If you want to use own or local tools to explore the HTTP Archive dataset, then you’ll need a local copy of the relevant data. However, before you rush ahead, do note that the datasets can be very large (e.g. >1TB), so plan accordingly and read through the available options to determine the best option for your particular case.
Downloading the full HAR files
You can download the individual HAR files for each and every site crawled by HTTP Archive. Each HAR file contains the full log of the navigation, all of the associated metadata, and even the response bodies for text content-types (e.g. HTML, CSS, JavaScript).
Demo HAR file for google.com, as recorded by HTTP Archive on 01/01/2015:
To view the available HAR datasets for download:
- https://console.cloud.google.com/storage/browser/httparchive?prefix=android
- https://console.cloud.google.com/storage/browser/httparchive?prefix=chrome
To download the full list of mobile (Chrome for Android) HAR files for a particular run:
$> gsutil -m rsync gs://httparchive/android-Jan_1_2016 .
Note: the denormalized HAR data is also available via BigQuery: httparchive:har
dataset.
Adjust the above bucket name to match the dataset you would like to sync. Also, do keep in mind that you might be downloading hundreds of thousands of individual HAR files to local disk - plan accordingly!
Downloading the summary tables
HTTP Archive builds a set of summary tables from the above HAR dataset. This dataset contains per-page aggregate statistics that are used to power the “trends” and “stats” pages on the site. You can download these tables both in MySQL and CSV formats.
Note: summary tables contain a subset of the data contained by the HAR files but still weigh in at ~5GB in size.
Selective export via BigQuery
If you’re only interested in downloading a subset of the available data, consider using BigQuery to select and export the relevant subset. For example, you can export subset of the available fields, or a subset of the sites (e.g. top 10K), and then use local tools to run your analysis.