For those just getting started with these datasets, here’s a quick introduction.
Read @paulcalvano’s excellent Getting Started with BigQuery guide to get your environment set up. Paul also has a few sections complete of his Guided Tour that goes into detail for each dataset (WIP).
HTTP Archive
HTTP Archive is a monthly dataset of how the web is built, containing metadata from ~5 million web pages. This is considered “lab data” in that the results come from a single test of a page load, but those results contains a wealth of information about the page.
The March 2020 dataset is available in tables named “2020_03_01” and suffixed with “desktop” or “mobile” accordingly. For example, the 2020_03_01_mobile table of the summary_pages dataset contains summary data about 5,484,239 mobile pages.
Note that the table name corresponds with March 1, 2020 but the tests took place throughout the month of March.
Other datasets exist with different kinds of information about each page:
Dataset | Description |
---|---|
pages | JSON data about each page including loading metrics, CPU stats, optimization info, and more |
requests | JSON data about each request including request/response headers, networking info, payload size, MIME type, etc |
response_bodies | (very expensive) Full payloads for text-based resources like HTML, JS, and CSS |
summary_pages | A subset of high-level stats per page |
summary_requests | A subset of high-level stats per request |
technologies | A list of which technologies are used per page, detected by Wappalyzer |
blink_features | A list of which JavaScript, HTML, or CSS APIs are used per page |
lighthouse | (mobile only) Full JSON Lighthouse report containing audit results in areas of accessibility, performance, mobile friendliness, SEO, and more |
Chrome UX Report
The Chrome UX Report is a monthly dataset of how the web is experienced. This is considered “field data” in that it is sourced from real Chrome users. Data in this project encapsulates the user experience with a small number of metrics including: time to first byte, first paint, first contentful paint, largest contentful paint, DOM content loaded, onload, first input delay, cumulative layout shift, and notification permission acceptance rates. You can query the data by origin (website) month, country, form factor (desktop/mobile/tablet), and effective connection type (4G, 3G, 2G, slow 2G, offline).
The most recent dataset is 202002 (Februrary 2020). The March 2020 dataset will be released on April 14 and it includes user experience data for the full calendar month of March.
Data for each metric is organized as a histogram so that you can measure the percent of user experiences for a given range of times, for example how often users experience TTFB between 0 and 200 ms. If you want fine-grained control over these ranges, you can query the all
or country-specific datasets. If you want to query these ranges over time, you should use the experimental
dataset, which is optimized for month-to-month analysis. See this post for more info.
If you don’t need fine-grained control of the histogram ranges, we summarize key “fast”, “average”, and “slow” percentages in the materialized
dataset. This is also optimized for month-to-month analysis.
For even more info, check out the Web Almanac methodology for an in-depth explanation of how this transparency data is used for a large-scale research project.