Hi everyone, a few updates related to the 10x project. Read last month’s announcement for the context.
Exceeding 10M home pages in the July crawl
The capacity timing could not have been better—it’s almost as if we knew it was coming! The Chrome UX Report (CrUX), which as a reminder is the upstream data source from which HTTP Archive gets its URLs, has exceeded 10M origins as of the most recent 202205 dataset that was released on June 14. 11,024,795 origins to be exact, an increase of 28% over the previous month.
HTTP Archive will support 100% of those websites in our upcoming July crawl, starting on the first of the month. This is an exciting milestone for both CrUX and HTTP Archive. It’s also an important test of the resilience of our new infrastructure.
Testing websites with ambiguous form factors
As described in the CrUX 202205 release notes, this 28% increase in coverage is not organic, rather it’s due to a methodological change in the way CrUX aggregates site data. Previously, the CrUX dataset would annotate in the form_factor
dimension whether the user experience data corresponds to desktop users or mobile users. If there was insufficient data from desktop users, it would be omitted from the dataset. If that segment represented more than 20% of the origin’s population, that was too massive of an omission, and the entire origin would be excluded from the dataset. The important difference is that rather than omitting the origin, CrUX will now attempt to combine desktop and mobile experiences together to give the origin a “second chance” to meet the sufficient data threshold. Those corresponding form_factor
values would be set to NULL
.
HTTP Archive has a desktop crawl and a mobile crawl. We use the form_factor
field from CrUX to know whether to test a page in one environment or the other, which affects how the page is rendered and sometimes even the content itself. So we have a decision to make about how to handle NULL
device types.
What we’ve decided is to crawl home pages of origins with form_factor=NULL
in both desktop and mobile environments. Given that it took the combination of desktop and mobile UX data to meet the sufficient data threshold, this solution is still true to the environments of real users as they visited the site.
New streaming crawl pipeline
As part of our infrastructure improvements, we’re overhauling the entire data pipeline. This is the process that takes a URL, tests it using WebPageTest, and processes the results for writing into the relevant BigQuery tables.
The old way was to hold the results off to the side until every test was finished, then process everything all at once. It could take a day or two between the end of testing and the start of processing. The new behavior is to stream each test result into BigQuery as soon as it’s ready. If you’re writing queries that don’t depend on the full dataset, this will enable you to get some sample data weeks sooner.
We’ll need to experiment with ways to make it clear when the dataset is “done” and fully ready to analyze. Maybe we’ll annotate tables as partial
somehow, to be explicit that data is still streaming in. Let us know in the comments if the July behavior gives you any trouble or if you have any suggestions on how to improve the process.
Improvements to the CWV Technology Report
Starting with the May dataset, the CWV Technology Report (cwvtech.report) will have two exciting new capabilities:
- Lighthouse data for desktop pages
- Technology detections for secondary pages
Lighthouse for desktop
As of the May crawl, we now have Lighthouse data in BigQuery for desktop pages at httparchive.lighthouse.2022_05_01_desktop
. We’ve added this to the CWV Technology Report’s Lighthouse section.
Given the different performance characteristics of desktop and mobile lab environments, we expect this to provide new insights about the opportunities available to each technology.
Bonus tip: did you know that the chart+gear icon allows you to select different Lighthouse category scores? Whenever you see this icon above a chart, click it to see options for different metrics to include.
Technology detections for secondary pages
Now that we’re able to crawl up to two pages per site (home page plus one secondary page), we can include the technologies found on secondary pages to improve the accuracy of our origin-level technology detections.
If either the home page or secondary page is found to use a technology, we count that origin as using the technology exactly once. This will strictly increase adoption given that we’re expanding the pool of candidate pages on which to detect the technologies.
Percentage of AMP origins having good CWV
This change in adoption may also result in a small but noticeable change in the overall performance of that technology. The chart above shows the performance of websites using AMP over time. Notice that there was an uptick in performance that actually started in April and continued through May.
Number of AMP origins
We can change the chart’s metric to show the total number of origins, and there is clearly a massive increase in adoption. It seems that home pages are less likely to be built with AMP, so by testing beyond home pages we’re able to massively improve the accuracy of the dataset. It’s nice that we don’t see any huge jumps in CWV performance, meaning that the home page data was still a representative sample for the most part. Let us know if you see any unexpected results.
Let us know what you think of all the new changes!
Rick