Quickstart guide to exploring the HTTP Archive

Curious about how the web is built and is evolving over time? You’re in the right place.

Every month, HTTP Archive crawls the top sites on the web and records everything it sees: request and response headers for each and every request on each page, timing and profiling data, and even the response bodies for the text-based resources–to learn more about this process, check out the FAQ. This data is what powers the trends and insights you find on this site and it is also available for you to query and explore.

Ready to get started? Let’s explore the available options for how to analyze this data…

Visit HTTP Archive stats and trends

As a first step, check out the precomputed summary reports in your browser:

  • Stats report provides a per-crawl report on how the web-pages are constructed - e.g. their size, number of images, use of web fonts, and so on.
  • Trends report provides a time-series view on how the key aspects of the webpages—the ones highlighted in the stats reports—are changing over time.

The data exposed in above reports is not an exhaustive list of available metrics. In fact, there is a lot more to explore and if you can’t find what you’re looking for, or want to dig deeper into particular trend or metric, then you can do so via one of the following routes.

Explore the HTTP Archive dataset

HTTP Archive is powered by WebPageTest. Each page is crawled using the WPT agent and the results, which include request and response headers, text response bodies, and profiling data are archived. This yields ~1TB of data for each and every crawl. There are two options to access this data for your own analysis:

  1. Download it for local analysis.
  2. Use one of the cloud tools below to explore the data.

The former allows you to use own and local tools to explore the data, but comes with a high upfront download and setup costs. The latter option provides easy and instant “cloud access” to the data, but requires a Google developer account and, depending on the amount of type of analysis you run, may have extra costs.

Exploring via BigQuery

All of the HTTP Archive crawl data is available on Google BigQuery, which is a warehouse and an engine for large-scale data analytics:

There are two datasets that you can query:

  1. https://bigquery.cloud.google.com/dataset/httparchive:har - these tables contain the full denormalized HAR output for each and every page in the crawl. The navigation requests are stored in the pages table, and subresource requests are recorded as separate rows in the corresponding requests table.
  • Take a look at the pages and requests table schemas. Tip: click “preview” to see example data.
  • Note that this dataset keeps most of its data inside the payload fields that contain JSON data. This is intentional, to allow the schema to change with time. To access fields inside of these payloads, use the JSON functions provided by BigQuery.
  1. https://bigquery.cloud.google.com/dataset/httparchive:runs - these tables contain summary reports for each crawl, with a collection of precomputed metrics.

For some fun and insightful examples of using BigQuery, check out one of many hands-on examples in our Analysis section, and the following lightning talk from Velocity conference:

Exploring with Datalab

If you want to go beyond running a simple query and compose a report or do more in-depth analysis, consider using the Cloud Datalab: you can query BigQuery from the Jupyter notebook, crunch and visualize the results with Pandas, matplotlib, and other popular tools.

Demo: https://github.com/HTTPArchive/bigquery/blob/master/datalab/histograms.ipynb

Crunching data with Dataflow

The most involved—and the most powerful—way to analyze the full HTTP Archive dataset is via Dataflow, which enables you to write and run custom ETL and batch computation jobs in the cloud. For example, HTTP Archive uses dataflow to load the raw HAR archives provided by the WebPageTest into BigQuery.

Demo: https://github.com/HTTPArchive/bigquery/tree/master/dataflow

2 Likes