We have each month different number of URLs in httparchive dataset. This is because Google provides in its dataset differently or does HTTP Archive do some filtering?
The URLs that HTTP Archive tests each month come from the latest Chrome UX Report (CrUX) dataset. CrUX is a collection of origins that reflect real UX trends on the web, so as usage changes the HTTP Archive dataset will fluctuate.
Thanks! and do you know why the number of domains in CrUX varies?
CrUX is a reflection of real world web usage month-to-month.
and is there any information regarding discrepancy between Alexa and CrUX. I checked some URLs there are in both dataset. But I couldn’t find any info for example how many URL in Alexa is also in CrUX?
Different datasets slurped using different user bases and updated
constantly so such comparisons are not really meaningful even if
possible.
Charlie
And do you know if google provides also a ranking in CrUX?
I’m not sure what you mean by “ranking” but you can find out everything
you need about CrUX from the website.
Charlie
CrUX is not ranked. You can join the URLs in HTTP Archive with old Alexa datasets (see Getting domain rank with the new rank-less Chrome UX Report corpus) but the Alexa dataset only provides ranking at the domain granularity*.
About 2/3 of the domains in CrUX are also found in the ranked Alexa domain list.