I noticed that the there are less websites included in the HTTPArchive dataset than in the CrUX list. For example, for the top 10k CrUX, only 6531 websites are in httparchive.crawl.pages. The difference seems way to high to be related to crawler failure-related issues. Is there another reason for it?
This is the query I am using:
SELECT DISTINCT page
FROM httparchive.crawl.pages
JOIN `chrome-ux-report.all.202512`
ON NET.HOST(page) = NET.HOST(origin)
WHERE
date = '2025-12-01'
AND client = 'desktop'
AND is_root_page
AND experimental.popularity.rank <= 10000
There’s a number of reasons for this, including:
- The ranks are shared between clients and you’re only looking at desktop. Mobile has more coverage but even then it’s far from 100% for the other reasons.
- Some sites block our crawler. We do identify ourselves with PTST in the user agent as good net citizens but the downside of that is we get blocked as “bot traffic” by a percentage of sites.
- We crawl only from US datacenters and so some sites redirect to the US version of the site and we stop the crawl for that site for origin changes (that US origin will likely be already in the CrUX list if popular enough).
See also this thread: Investigate missing Top 1k home pages · Issue #222 · HTTPArchive/data-pipeline · GitHub where we explained ~20% of sites are blocked. In that thread we also introduced the crawl_failures table where you can see the rejection reasons for any missing entries.