I wanted to share an example query that shows how it’s still possible to infer site rank even though we switched over to the 1.3M unranked Chrome UX Report URLs.
The 1.3M URLs were explicitly selected to be on domains in the March 15, 2017 Alexa 1M list. In BigQuery this list is accessible at httparchive.urls.20170315. This date is used because it’s the last snapshot we have before Alexa switched to using (what I consider to be) a much lower quality list.
Given that, every URL maps to a domain that has a rank in the Alexa 1M list. We can join any table with a URL field to the Alexa list to get its rank:
#standardSQL SELECT Alexa_rank AS rank, url FROM `httparchive.pages.2018_08_01_desktop` JOIN `httparchive.urls.20170315` ON NET.REG_DOMAIN(url) = Alexa_domain WHERE JSON_EXTRACT(payload, '$._blinkFeatureFirstUsed.Features.V8SpeechRecognition_Start_Method') IS NOT NULL ORDER BY rank
For example, this query shows us the domain rank and URL for all 4 sites that do some kind of speech recognition. Note that rank 46511 is duplicated across two separate URLs. This is because they’re on the same domain (mcs.gov.sa).
So we lose per-URL rank granularity but we gain the ability to analyze multiple origins for a given domain, which is how we ended up with 1.3M > 1M URLs. This is especially important for domains that host user-generated content, like WordPress or Blogger. And also for large companies with many products under a single domain, like maps.google.com, mail.google.com, books.google.com, etc. Previously, we only knew that “google.com” was rank #1 and so we’d simply crawl http://www.google.com, missing out on all of the other popular sites.
We haven’t done it yet, but at some point we will regenerate the list of URLs from newer Chrome UX Report datasets. We will probably also intersect it with the Alexa 1M to preserve coarse ranking info, but it depends on a few other factors like our crawl capacity.