Use Tranco list instead of Alexa Top 1M

Per the FAQ:

Only the URLs whose domain is in the Alexa Top 1,000,000 ranked list are included.

However, research shows that the Alexa ranking (and similar rankings) can change significantly on a daily basis and can trivially be manipulated by malicious actors. The researchers present the Tranco list as a more robust alternative.

What do y’all think? Should HTTP Archive switch to using the Tranco list?

1 Like

I believe HTTP Archive no longer uses Alexa rankings, though @rviscomi knows better than I.

1 Like

Hi @mathias - thanks for sharing! This looks really interesting.

@developit is right, we don’t currently have a ranking indicator in the HTTP Archive data. That Alexa list was outdated as well, since the top million list is no longer available to us. There was a GitHub issue on this topic here - https://github.com/HTTPArchive/httparchive.org/issues/125. If we can find a data source that contains both ranking data as well as category info then that would be great!

@rviscomi wrote about a useful technique to include rank information in our queries by joining the last Alexa rank dataset we have with the latest HTTP Archive data. You can see this approach here - Getting domain rank with the new rank-less Chrome UX Report corpus.

Two potential problems I see:

  • The HTTP Archive is now tracking more than 4 million sites. So a top million list will leave many sites unranked.
  • There is a 1 to many relationship between domain names in the dataset and fully qualified hostnames in the archive. So if you wanted to search for the top 1000 sites, a few hundred of them might be subdomains of popular domains. We have the same limitation with the Alexa ranked dataset though.

That said, it’s definitely worth reviewing the Tranco list to see whether it will be useful as a replacement for Alexa. I imported the dataset into BigQuery and ran some queries to compare the two datasets. I found that they both missed more than 75% of the URLs tracked by the archive (not surprising based on the size of the dataset), and that they both ranked over 200K sites that the other dataset did not include.

image

1 Like

The rankings really are SEO or advertising oriented. The CrUX list has the advantage (and disadvantage) of being entirely quantitative and, as it has no ranking, pointless to try and game. I think sectoral or national groupings probably make more sense than any kind ranking.

2 Likes

If we’re no longer using the Alexa list, can we update the FAQ entry accordingly? Thanks.

1 Like

Hey! One of the authors of the Tranco list here

If you are using the CrUX list in its entirety, I belief that it provides a better source of domains for HTTP Archive. In fact, in the near future we’re adding a configuration option to only include domains that are also in CrUX.

If you’d like to use Tranco to rank the CrUX domains, you could use the entire list (as of today, this contains 7,160,985 domains), though the difference between a domain ranked 5M and one ranked 7M will not be very representative of course. To get the full list, simply add some zeroes to the URL, e.g. https://tranco-list.eu/list/N7JW/1000000https://tranco-list.eu/list/N7JW/100000000

If you are facing some shortcomings/challenges with Alexa/CrUX/Tranco/…, please let us know. As part of ongoing research we’re trying to push forward the current state of practice, so we’re very interested in knowing different use cases and see how to best accommodate them.

2 Likes

Yes, thanks for pointing that out. Filed this bug: Update the FAQ about where HA gets its URLs · Issue #129 · HTTPArchive/httparchive.org · GitHub

PR #130 (merged) https://github.com/HTTPArchive/httparchive.org/pull/130 indicates CrUX is used, and Alexa not.