Downloading HAR-Datasets later than May 2022?

researcher · June 9, 2023, 10:46am

Hi everyone,

for scientific research, we would like to download the HAR or summary data from the httparchive folder as stated in the tutorial 1.

However, when checking the available datasets for download 2, it only shows the downloads until May 2022. Was this stopped somehow or is the data still available somewhere? I cannot use BigQuery to get the data, as the free tier seems not to be enough here.

Thanks in advance!

rviscomi · June 24, 2023, 11:46pm

Hi @researcher

Last year we moved from individual top-level crawl directories to a crawls parent directory. You can access later HARs in GCS here: https://console.cloud.google.com/storage/browser/httparchive/crawls

kzarifis · October 12, 2023, 11:10pm

Hi @rviscomi , I was looking for the same and found your answer here, but I have a followup question: is there a way to download the full ‘pages’ data for one month as a zipped single file? I think remember doing that years ago from some legacy links, but now I only see the option to download HARs as separate objects from the buckets you mentioned above.

Alternatively, is there a way to download e.g. only the HARs for the top 1K pages? I was looking at Minimizing query costs that suggests using ‘rank=1000’ for this, but looks like if I want to grab the payload from bigquery, it’s still a very expensive query

rviscomi · October 16, 2023, 4:43pm

Here’s an example query that extracts the HAR info for the top 1000 pages in September:

SELECT
  client,
  page,
  payload AS HAR
FROM
  `httparchive.all.pages`
WHERE
  date = '2023-09-01'
  AND is_root_page
  AND rank = 1000

The actual number of bytes billed for this query is only 941 MB (thanks to the WHERE clause). I think there’s a bug in BigQuery where the estimate is inaccurate for anyone who isn’t a project editor, but rest assured this is a relatively inexpensive query.

After you’ve run the query, you have a few options to save the data locally or to Google Drive.

kzarifis · October 16, 2023, 7:09pm

Indeed, it was the estimate that was scaring me. Thanks for clarifying that!

Topic		Replies	Views
Pages and Requests later than Apr 2022 in gs://httparchive/downloads/	6	871	November 27, 2023
How to download the HTTP Archive data FAQ	0	6764	February 25, 2016
Download .har files? Analysis	1	1969	February 10, 2018
Does BigQuery contain HAR Archive or cookies of crawled webpages?	13	2955	June 11, 2021
Accessing Web Almanac 2022's raw data?	2	622	August 30, 2023

Downloading HAR-Datasets later than May 2022?

Related topics