Use Tranco list instead of Alexa Top 1M

Hi @mathias - thanks for sharing! This looks really interesting.

@developit is right, we don’t currently have a ranking indicator in the HTTP Archive data. That Alexa list was outdated as well, since the top million list is no longer available to us. There was a GitHub issue on this topic here - https://github.com/HTTPArchive/httparchive.org/issues/125. If we can find a data source that contains both ranking data as well as category info then that would be great!

@rviscomi wrote about a useful technique to include rank information in our queries by joining the last Alexa rank dataset we have with the latest HTTP Archive data. You can see this approach here - Getting domain rank with the new rank-less Chrome UX Report corpus.

Two potential problems I see:

  • The HTTP Archive is now tracking more than 4 million sites. So a top million list will leave many sites unranked.
  • There is a 1 to many relationship between domain names in the dataset and fully qualified hostnames in the archive. So if you wanted to search for the top 1000 sites, a few hundred of them might be subdomains of popular domains. We have the same limitation with the Alexa ranked dataset though.

That said, it’s definitely worth reviewing the Tranco list to see whether it will be useful as a replacement for Alexa. I imported the dataset into BigQuery and ran some queries to compare the two datasets. I found that they both missed more than 75% of the URLs tracked by the archive (not surprising based on the size of the dataset), and that they both ranked over 200K sites that the other dataset did not include.

image

1 Like