Changes to the HTTP Archive corpus


#1


“Sample Size” on the State of the Web report

Our goal is to have a high quality corpus of websites that we test regularly to accurately track the state of the web. To that end, we’ve started making changes to improve the quality of our corpus:

  1. As of 2018_07_01 we started using URLs from the Chrome UX Report (CrUX). We switched mobile URLs to CrUX on 07_15. This increased our coverage from 500K URLs to ~1.3M.
  2. As of 2018_12_01 we have decreased the number of test runs per URL from 3 to 1. The loss of redundancy may affect the reliability of time-based performance metrics but these are not especially useful in synthetic tests. For accurate real-world performance data join with the CrUX dataset. This change reduces the time for the crawl to complete, allowing us to add more URLs.
  3. As of 2018_12_15 we have increased the desktop URLs to all CrUX URLs for desktop (3.9M).
  4. As of 2019_01_01 we are reducing the crawl frequency from semi-monthly to monthly. Combined with the reduced runs per URL, this change will enable us to afford testing the full CrUX corpus for both desktop and mobile. As of this crawl we will increase the mobile URLs to all CrUX URLs for mobile (4.2M).

Feel free to reply to this thread if you have any questions or comments.


Data collection in HTTPArchive
#2