[10x project] Exceeding 10M and announcing new capabilities

rviscomi · June 30, 2022, 10:49pm

Hi everyone, a few updates related to the 10x project. Read last month’s announcement for the context.

Exceeding 10M home pages in the July crawl

The capacity timing could not have been better—it’s almost as if we knew it was coming! The Chrome UX Report (CrUX), which as a reminder is the upstream data source from which HTTP Archive gets its URLs, has exceeded 10M origins as of the most recent 202205 dataset that was released on June 14. 11,024,795 origins to be exact, an increase of 28% over the previous month.

HTTP Archive will support 100% of those websites in our upcoming July crawl, starting on the first of the month. This is an exciting milestone for both CrUX and HTTP Archive. It’s also an important test of the resilience of our new infrastructure.

Testing websites with ambiguous form factors

As described in the CrUX 202205 release notes, this 28% increase in coverage is not organic, rather it’s due to a methodological change in the way CrUX aggregates site data. Previously, the CrUX dataset would annotate in the form_factor dimension whether the user experience data corresponds to desktop users or mobile users. If there was insufficient data from desktop users, it would be omitted from the dataset. If that segment represented more than 20% of the origin’s population, that was too massive of an omission, and the entire origin would be excluded from the dataset. The important difference is that rather than omitting the origin, CrUX will now attempt to combine desktop and mobile experiences together to give the origin a “second chance” to meet the sufficient data threshold. Those corresponding form_factor values would be set to NULL.

HTTP Archive has a desktop crawl and a mobile crawl. We use the form_factor field from CrUX to know whether to test a page in one environment or the other, which affects how the page is rendered and sometimes even the content itself. So we have a decision to make about how to handle NULL device types.

What we’ve decided is to crawl home pages of origins with form_factor=NULL in both desktop and mobile environments. Given that it took the combination of desktop and mobile UX data to meet the sufficient data threshold, this solution is still true to the environments of real users as they visited the site.

New streaming crawl pipeline

As part of our infrastructure improvements, we’re overhauling the entire data pipeline. This is the process that takes a URL, tests it using WebPageTest, and processes the results for writing into the relevant BigQuery tables.

Diagram of the new streaming pipeline, showing processing and testing in parallel

The old way was to hold the results off to the side until every test was finished, then process everything all at once. It could take a day or two between the end of testing and the start of processing. The new behavior is to stream each test result into BigQuery as soon as it’s ready. If you’re writing queries that don’t depend on the full dataset, this will enable you to get some sample data weeks sooner.

We’ll need to experiment with ways to make it clear when the dataset is “done” and fully ready to analyze. Maybe we’ll annotate tables as partial somehow, to be explicit that data is still streaming in. Let us know in the comments if the July behavior gives you any trouble or if you have any suggestions on how to improve the process.

Improvements to the CWV Technology Report

Starting with the May dataset, the CWV Technology Report (cwvtech.report) will have two exciting new capabilities:

Lighthouse data for desktop pages
Technology detections for secondary pages

Lighthouse for desktop

As of the May crawl, we now have Lighthouse data in BigQuery for desktop pages at httparchive.lighthouse.2022_05_01_desktop. We’ve added this to the CWV Technology Report’s Lighthouse section.

Given the different performance characteristics of desktop and mobile lab environments, we expect this to provide new insights about the opportunities available to each technology.

Screenshot of the CWV Tech Report showing performance, accessibility, PWA, and SEO score options

Bonus tip: did you know that the chart+gear icon allows you to select different Lighthouse category scores? Whenever you see this icon above a chart, click it to see options for different metrics to include.

Technology detections for secondary pages

Now that we’re able to crawl up to two pages per site (home page plus one secondary page), we can include the technologies found on secondary pages to improve the accuracy of our origin-level technology detections.

If either the home page or secondary page is found to use a technology, we count that origin as using the technology exactly once. This will strictly increase adoption given that we’re expanding the pool of candidate pages on which to detect the technologies.

Percentage of AMP origins having good CWV

This change in adoption may also result in a small but noticeable change in the overall performance of that technology. The chart above shows the performance of websites using AMP over time. Notice that there was an uptick in performance that actually started in April and continued through May.

Number of AMP origins

We can change the chart’s metric to show the total number of origins, and there is clearly a massive increase in adoption. It seems that home pages are less likely to be built with AMP, so by testing beyond home pages we’re able to massively improve the accuracy of the dataset. It’s nice that we don’t see any huge jumps in CWV performance, meaning that the home page data was still a representative sample for the most part. Let us know if you see any unexpected results.

Let us know what you think of all the new changes!

Rick

imkevdev · July 1, 2022, 10:29am

Great stuff! It would be great if in CWV Tech Report we are able to get the origins with technology / total number of origins too. With the changes to CrUX and increased number of origins and the addition of secondary pages in H/A, it may not be obvious whether the increase in AMP is because of the secondary pages using AMP or just an increase in the absolute number of origins because the size of the dataset increased.

Perhaps in the case of AMP it may be easy to understand, but it may not be as obvious for all technologies, e.g. React.

rviscomi · July 2, 2022, 3:22pm

Thanks @imkevdev! Getting # origins as a relative % could be useful. Here’s one way to do it manually:

Select the technologies you’re interested in, plus ALL (ALL, React, AMP)
Switch the chart to # origins
In the three-dot menu of the chart, select Export > Google Sheets (or CSV if that’s your thing)
Manually calculate the relative % by dividing each technology by the ALL total

Here’s the corresponding charts for AMP and React:

Interestingly, we can see there were two major events for each technology

AMP
- Adoption increased August 2020, maybe a detection change?
- Adoption increased May 2022, due to secondary page detections
React
- Adoption dropped July 2020
- Adoption recovered July 2021, both I think due to a bug/fix in Wappalyzer

I’ll look into adding this as a standard feature of the CWV Tech Report. Meanwhile we have this manual workaround.

Eventually, I think the most useful thing would be to move the dashboard off of Data Studio and rebuilt as a web app somewhere on httparchive.org. That’ll give us full flexibility to do things like normalizing adoption and even CWV performance trends in a toggle-able way.

Topic		Replies	Views
Improving the HTTP Archive pipeline and dataset by 10x Announcements	8	4367	July 1, 2022
Changes to the HTTP Archive corpus Announcements	1	6735	December 30, 2018
Missing websites in April 2022 Analysis	4	1263	December 19, 2022
Announcing the new HTTP Archive! Announcements	3	13519	April 4, 2018
CDN Usage - December 2018 Analysis	6	2245	December 31, 2018