Hi everyone,
for scientific research, we would like to download the HAR or summary data from the httparchive folder as stated in the tutorial 1.
However, when checking the available datasets for download 2, it only shows the downloads until May 2022. Was this stopped somehow or is the data still available somewhere? I cannot use BigQuery to get the data, as the free tier seems not to be enough here.
Thanks in advance!
Hi @researcher
Last year we moved from individual top-level crawl directories to a crawls
parent directory. You can access later HARs in GCS here: https://console.cloud.google.com/storage/browser/httparchive/crawls
Hi @rviscomi , I was looking for the same and found your answer here, but I have a followup question: is there a way to download the full ‘pages’ data for one month as a zipped single file? I think remember doing that years ago from some legacy links, but now I only see the option to download HARs as separate objects from the buckets you mentioned above.
Alternatively, is there a way to download e.g. only the HARs for the top 1K pages? I was looking at Minimizing query costs that suggests using ‘rank=1000’ for this, but looks like if I want to grab the payload from bigquery, it’s still a very expensive query
Here’s an example query that extracts the HAR info for the top 1000 pages in September:
SELECT
client,
page,
payload AS HAR
FROM
`httparchive.all.pages`
WHERE
date = '2023-09-01'
AND is_root_page
AND rank = 1000
The actual number of bytes billed for this query is only 941 MB (thanks to the WHERE
clause). I think there’s a bug in BigQuery where the estimate is inaccurate for anyone who isn’t a project editor, but rest assured this is a relatively inexpensive query.
After you’ve run the query, you have a few options to save the data locally or to Google Drive.
1 Like
Indeed, it was the estimate that was scaring me. Thanks for clarifying that!