Improving the HTTP Archive pipeline and dataset by 10x

Hi everyone,

We’re making some major changes to the pipeline and dataset, so if you analyze HTTP Archive data on BigQuery or browse stats on our website or in the Web Almanac, continue reading.

HTTP Archive is over 11 years old and has gone through a few stages of evolution. You can see just how much it has grown since the first crawl of 16k pages in 2010, all the way up to over 8M pages in the latest dataset.

The biggest change to the corpus came in 2018 when we switched from Alexa to the Chrome UX Report as the source for our test URLs. This required a big investment in our infrastructure to scale up to a capacity of around 1M pages per month to 10M.

Now we’re ready to make the next leap to scale our infrastructure to 100M pages per month! :rocket:

Don’t we already have enough capacity?

Not really! As you can tell from the chart above, the number of websites in the upstream CrUX corpus has been growing steadily, especially for mobile websites. We’re approaching our ceiling of 10M pages, which would mean that we’d need to start excluding sites from our crawl in order to complete in the monthly time frame. Doing so would hamper our ability to track the state of the web by losing out on valuable data from the long tail.

We’re not running all of our capabilities at 100% either. Specifically, Lighthouse auditing is only run on mobile pages because of these limitations. We’d love to have richer insights into how desktop pages could be optimized by running them through Lighthouse.

Another oft-cited limitation of HTTP Archive has been our inability to crawl beyond home pages. This has introduced biases into our results because the web goes so much deeper than home pages. We’d love to improve the quality of our dataset by crawling secondary pages linked off from the home pages.

So in order to keep pace with the growth of the web and provide the best quality data, we need to expand our capacity.

How are we going to get 10x capacity?

The “legacy” pipeline built on PHP and MySQL has been serving us well for 11+ years and is finally getting a well-deserved retirement. Maintaining that code and the physical servers it lives on are getting to be infeasible at the scales we’re reaching for, so we’ll be turning to more cloud-based infrastructure.

We’ve used Google Cloud Platform (GCP) to supplement our infrastructure needs, thanks to Google’s generous sponsorship of the HTTP Archive project. In order to sustain 100M pages each month, we’ll need to increase our consumption of GCP resources. While it’d be nice to turn a dial and easily increase capacity, the truth is that the entire pipeline needs to be rebuilt to successfully operate at that scale.

Thankfully, Google has also awarded HTTP Archive a grant of $40,000 to invest in the engineering efforts to support this migration. We’ll be working with a contractor to rebuild the pipeline using the GCP tech stack.

So we’re decommissioning our old servers and rebuilding the data pipeline to be able to achieve 10x capacity.

How will this affect the way you access HTTP Archive data on BigQuery and on the web?

Regretfully, querying HTTP Archive data ain’t cheap. BigQuery users get a free quota of 1 TB of processing per month, but in many cases that’s not enough to answer basic questions without breaking the bank. And that’s at the current capacity, much less 10x.

To avoid burdening our users with additional BigQuery expenses, we’re going to need to rearchitect the way the datasets are organized. We’ve seen some early successes with the annual httparchive.almanac dataset, which takes advantage of BigQuery best practices like partitioning and clustering to ensure that you don’t incur costs for more than you need. We’ll be creating new datasets for the monthly crawls that use a similar config. We still strongly recommend setting up cost controls for BigQuery to avoid any surprise bills.

BigQuery becomes unsustainably expensive when querying tables with enormous payloads. The response_bodies dataset enables incredibly powerful analyses by querying the raw resource data. The lighthouse and pages datasets contain huge JSON blobs containing thousands of data points each. But these datasets are already prohibitively expensive, weighing in at 1 TB or more (sometimes much more). If we were to crawl 10x as many pages, the entry-level cost to query one of these tables would be at least $50, which is unacceptably high.

To make response bodies more efficient, we’re encouraging everyone to contribute to our custom metrics effort, which we use to evaluate the contents of the page at runtime and emit summary statistics. There are other benefits to doing more work in custom metrics, like being able to query the rendered DOM. This doesn’t work for historical data, but it does make it much cheaper and easier to query going forward.

To make existing datasets more efficient, we’re proposing a new schema in which high value stats are extracted from the monolithic JSON objects and surfaced in a top-level, JSON-encoded summarization fields. Since you only incur costs for the bytes you process, these smaller fields should be much cheaper to query.

Crawling beyond home pages also comes with unique challenges when it comes to dataset organization. Up until now, queries have been written under the assumption that there is always one page per website. But in a world with secondary pages, we need a way to distinguish them from root pages. This is also something we’ll address with the new schema.

The flip side to this is that it would be great if the community could find answers to their questions without having to write any SQL or worry about costs. This is more of a long-term goal, but we’d love to make better use of the httparchive.org website to surface stats and trends of interest. The Web Almanac effort has been a great way for us to figure out what kinds of questions are worth asking, revisiting it on an annual basis. The dream is to effectively make httparchive.org an evergreen dashboard of stats, with the Web Almanac continuing to exist as an annual snapshot with expert-curated reporting.

When is this happening?

It’s already happening! :tada:

Thanks to some of our capacity improvements, you may have noticed that instead of taking ~4 weeks to finish, recent crawls have been available on BigQuery after only a matter of days.

This month (May 2022) we’ll be switching over to our new pipeline to generate the summary_pages and summary_requests, deprecating the old legacy pipeline. There may be some discrepancies in the way summary stats are calculated, so if you notice anything amiss, please let us know by filing an issue in our data-pipeline repository.

We don’t plan to make any breaking changes to the datasets without an announcement like this first, so make sure you subscribe to the Announcements channel for updates. We’ll support both old and new BigQuery schemas for a period of time before transitioning over.


The HTTP Archive team is working really hard to maintain the best public dataset on the state of the web. We’ll try to keep the disruptions to a minimum, but thank you all for bearing with us as we pull off this tricky migration. Hopefully it unlocks many powerful new capabilities for you all to use to discover even higher quality insights.

Let us know in the comments, in the data-pipeline repository, or in the #httparchive Slack if you have any feedback, suggestions, or issues.

Rick

4 Likes

One side effect of this migration that’s also worth mentioning is that we’re planning to shut down legacy.httparchive.org. We deprecated it in 2018 and have been putting it off for as long as we can, but with this major infrastructure change it’s time to pull the plug.

What’s going away? The legacy.httparchive.org website will stop serving traffic as early as 90 days from now. Until then, you’ll get an annoying upgrade prompt :smile:. The downloadable CSV and database dumps will not be updated after April 2022 and completely inaccessible after the website stops serving traffic. All of the raw data will continue to exist in Google Cloud Storage and you can export BigQuery tables in CSV format if needed.

Yes! Can’t wait to learn more on this…

1 Like

Hi Rick,

I can understand the desire to move away from the legacy database structure: it would probably be fine for individual site reports but the derived metrics certainly would put strain on it. In addition, using PHP for batch processing has left the code fairly hard to maintain.

As I continue to use the site reports for work, I’d like to know more about either doing the work directly in BigQuery or exporting as CSV to match my existing pipeline. I have a limited set of URLs I’m interested and a date range of either the last month or the last 13 months, depending on the “cost”. Is it fairly straightforward, and what’s the approximate cost for the datasets for say 100 URLs for 2022-05?

Hi @charlie.clark. Here’s an example query to extract the ~200 URLs on the google.com domain from the most recent desktop summary pages table:

SELECT
  *
FROM
  `httparchive.summary_pages.2022_04_01_desktop`
WHERE
  NET.REG_DOMAIN(url) = 'google.com'

The total number of bytes processed is 4.12 GB. BigQuery gives everyone a 1 TB/month free quota, so you could comfortably query the dataset without incurring any costs.

As I mentioned, we’ll be changing the database schema soon, so here’s a preview of how that might work:

SELECT
  page,
  # Summary data not yet available,
  # so querying the comparable `metadata` field.
  metadata
FROM
  # This is just a temporary copy of the table.
  # The final name will be `pages`.
  `httparchive.all.pages_copy`
WHERE
  date = '2022-05-01' AND
  client = 'desktop' AND
  is_root_page AND
  NET.REG_DOMAIN(page) = 'google.com'

This experimental table also returns ~200 URLs from the google.com domain. The number of bytes processed similarly small, at only 1.49 GB.

So it’s fairly straightforward and should not exceed the free quota.

Hi Rick,

looks fairly straightforward. Presumably, I can pass in a list of URLs. I’ll give it a spin later.

Charlie

1 Like

I’ve now had time to have a go at creating my reports using the datasets in May and June. Unfortunately, I’m a little confused, firstly that there are dumps in May from 2022-05-01 and 2022-05-12, both of which have the label “May 12 2022” but they seem incomplete: I have a list of 60 domains that I use for comparison but I’m not getting all these. Furthermore, the dump from 2022-05-12 seems to contain secondary pages. Am I doing something wrong? Has the data moved? Is there a way to check for homepage or secondary pages? Also, for historical purposes, it would be useful to continue to increment the crawlid for each report.

1 Like

I’ve also noticed that there may be duplicates, well I’ve found at least one in 2022-06-01 (actually 2022-06-09). I can avoid this using DISTINCT but I don’t think this should be happening.

1 Like

Thanks @charlie.clark. Could you file issues for anything unexpected on GitHub?