Use Tranco list instead of Alexa Top 1M

mathias · February 25, 2019, 7:42am

Only the URLs whose domain is in the Alexa Top 1,000,000 ranked list are included.

However, research shows that the Alexa ranking (and similar rankings) can change significantly on a daily basis and can trivially be manipulated by malicious actors. The researchers present the Tranco list as a more robust alternative.

What do y’all think? Should HTTP Archive switch to using the Tranco list?

developit · February 25, 2019, 12:13pm

I believe HTTP Archive no longer uses Alexa rankings, though @rviscomi knows better than I.

paulcalvano · February 25, 2019, 1:18pm

Hi @mathias - thanks for sharing! This looks really interesting.

@developit is right, we don’t currently have a ranking indicator in the HTTP Archive data. That Alexa list was outdated as well, since the top million list is no longer available to us. There was a GitHub issue on this topic here - https://github.com/HTTPArchive/httparchive.org/issues/125. If we can find a data source that contains both ranking data as well as category info then that would be great!

@rviscomi wrote about a useful technique to include rank information in our queries by joining the last Alexa rank dataset we have with the latest HTTP Archive data. You can see this approach here - Getting domain rank with the new rank-less Chrome UX Report corpus.

Two potential problems I see:

The HTTP Archive is now tracking more than 4 million sites. So a top million list will leave many sites unranked.
There is a 1 to many relationship between domain names in the dataset and fully qualified hostnames in the archive. So if you wanted to search for the top 1000 sites, a few hundred of them might be subdomains of popular domains. We have the same limitation with the Alexa ranked dataset though.

That said, it’s definitely worth reviewing the Tranco list to see whether it will be useful as a replacement for Alexa. I imported the dataset into BigQuery and ran some queries to compare the two datasets. I found that they both missed more than 75% of the URLs tracked by the archive (not surprising based on the size of the dataset), and that they both ranked over 200K sites that the other dataset did not include.

charlie.clark · February 25, 2019, 2:13pm

The rankings really are SEO or advertising oriented. The CrUX list has the advantage (and disadvantage) of being entirely quantitative and, as it has no ranking, pointless to try and game. I think sectoral or national groupings probably make more sense than any kind ranking.

mathias · February 25, 2019, 2:49pm

If we’re no longer using the Alexa list, can we update the FAQ entry accordingly? Thanks.

tomvangoethem · February 25, 2019, 3:18pm

Hey! One of the authors of the Tranco list here

If you are using the CrUX list in its entirety, I belief that it provides a better source of domains for HTTP Archive. In fact, in the near future we’re adding a configuration option to only include domains that are also in CrUX.

If you’d like to use Tranco to rank the CrUX domains, you could use the entire list (as of today, this contains 7,160,985 domains), though the difference between a domain ranked 5M and one ranked 7M will not be very representative of course. To get the full list, simply add some zeroes to the URL, e.g. https://tranco-list.eu/list/N7JW/1000000 → https://tranco-list.eu/list/N7JW/100000000

If you are facing some shortcomings/challenges with Alexa/CrUX/Tranco/…, please let us know. As part of ongoing research we’re trying to push forward the current state of practice, so we’re very interested in knowing different use cases and see how to best accommodate them.

rviscomi · February 25, 2019, 5:03pm

Yes, thanks for pointing that out. Filed this bug: Update the FAQ about where HA gets its URLs · Issue #129 · HTTPArchive/httparchive.org · GitHub

sesam · March 13, 2019, 11:32am

PR #130 (merged) https://github.com/HTTPArchive/httparchive.org/pull/130 indicates CrUX is used, and Alexa not.

Topic		Replies	Views
Recent Alexa Ranks	4	1626	June 22, 2021
Number of domains in HTTPArchive Analysis	3	2077	September 24, 2017
Alexa Rank for each url Meta	1	1906	January 27, 2020
Getting domain rank with the new rank-less Chrome UX Report corpus Analysis	0	2985	August 24, 2018
Why number of URLs are changing in each month?	8	1202	April 22, 2020

Use Tranco list instead of Alexa Top 1M

Related topics