[NOTICE] Reorganizing the har and runs datasets on BigQuery

This is a notice that we’re making changes to the BigQuery datasets. If you run queries against the HTTP Archive, especially automated ones, please read.

In short, we’re moving from few datasets with many tables to more datasets with fewer tables. The har and runs datasets will no longer be updated. No changes to the data itself, only how it’s organized.

More specifically, the har dataset will be broken up into: lighthouse, pages, requests, and response_bodies. The runs dataset will be broken up into: summary_pages, summary_requests.

There are a few motivations for this change:

  • Make navigating BigQuery easier. By having more top-level datasets, you can find what you’re looking for without all the scrolling. This is especially important because with each crawl the datasets are only getting larger!
  • Support more advanced queries like table wildcards. These wouldn’t work reliably with heterogeneous datasets. See an example of the beta site making use of this feature.
  • Make the naming schemes of the tables more intuitive/consistent. For example, instead of requests_bodies the dataset will be response_bodies. And going forward, we will always annotate the client name in the table as either _desktop or _mobile. The runs dataset doesn’t use any suffix to denote desktop data but it does use _mobile. On the other hand, the har dataset refers to them as _chrome and _android respectively. This change makes sure everyone is using the same suffixes all the time.

Here’s a screenshot of how the most recent crawl is organized in BigQuery:


Edit: This may be cropped. Click to view the entire image

You can see that each table in har will now have its own specialized dataset (lighthouse, pages, requests, response_bodies) and same for runs (summary_pages, summary_requests).

Starting with the 2018_02_15 crawl, tables will only be added to the new datasets. har and runs will no longer be updated. In ~3 months after everyone is used to the new datasets, we’ll remove har and runs entirely. A separate announcement will follow.

For even more context on this change, you can read the design doc.

1 Like

As a heads up, make sure to click on the screenshot above! The preview cuts off the remainder of the available tables.

Hi igrigorik, Before I could check a website and see the CDN he used in the waybackmachine, now is this possible?
Cheers