Only the URLs whose domain is in the Alexa Top 1,000,000 ranked list are included.
However, research shows that the Alexa ranking (and similar rankings) can change significantly on a daily basis and can trivially be manipulated by malicious actors. The researchers present the Tranco list as a more robust alternative.
What do y’all think? Should HTTP Archive switch to using the Tranco list?
Hi @mathias - thanks for sharing! This looks really interesting.
@developit is right, we don’t currently have a ranking indicator in the HTTP Archive data. That Alexa list was outdated as well, since the top million list is no longer available to us. There was a GitHub issue on this topic here - https://github.com/HTTPArchive/httparchive.org/issues/125. If we can find a data source that contains both ranking data as well as category info then that would be great!
The HTTP Archive is now tracking more than 4 million sites. So a top million list will leave many sites unranked.
There is a 1 to many relationship between domain names in the dataset and fully qualified hostnames in the archive. So if you wanted to search for the top 1000 sites, a few hundred of them might be subdomains of popular domains. We have the same limitation with the Alexa ranked dataset though.
That said, it’s definitely worth reviewing the Tranco list to see whether it will be useful as a replacement for Alexa. I imported the dataset into BigQuery and ran some queries to compare the two datasets. I found that they both missed more than 75% of the URLs tracked by the archive (not surprising based on the size of the dataset), and that they both ranked over 200K sites that the other dataset did not include.
The rankings really are SEO or advertising oriented. The CrUX list has the advantage (and disadvantage) of being entirely quantitative and, as it has no ranking, pointless to try and game. I think sectoral or national groupings probably make more sense than any kind ranking.
If you are using the CrUX list in its entirety, I belief that it provides a better source of domains for HTTP Archive. In fact, in the near future we’re adding a configuration option to only include domains that are also in CrUX.
If you’d like to use Tranco to rank the CrUX domains, you could use the entire list (as of today, this contains 7,160,985 domains), though the difference between a domain ranked 5M and one ranked 7M will not be very representative of course. To get the full list, simply add some zeroes to the URL, e.g. https://tranco-list.eu/list/N7JW/1000000 → https://tranco-list.eu/list/N7JW/100000000
If you are facing some shortcomings/challenges with Alexa/CrUX/Tranco/…, please let us know. As part of ongoing research we’re trying to push forward the current state of practice, so we’re very interested in knowing different use cases and see how to best accommodate them.