for scientific research, we would like to download the HAR or summary data from the httparchive folder as stated in the tutorial 1.
However, when checking the available datasets for download 2, it only shows the downloads until May 2022. Was this stopped somehow or is the data still available somewhere? I cannot use BigQuery to get the data, as the free tier seems not to be enough here.
Thanks in advance!
Last year we moved from individual top-level crawl directories to a
crawls parent directory. You can access later HARs in GCS here: https://console.cloud.google.com/storage/browser/httparchive/crawls
Hi @rviscomi , I was looking for the same and found your answer here, but I have a followup question: is there a way to download the full ‘pages’ data for one month as a zipped single file? I think remember doing that years ago from some legacy links, but now I only see the option to download HARs as separate objects from the buckets you mentioned above.
Alternatively, is there a way to download e.g. only the HARs for the top 1K pages? I was looking at Minimizing query costs that suggests using ‘rank=1000’ for this, but looks like if I want to grab the payload from bigquery, it’s still a very expensive query
Here’s an example query that extracts the HAR info for the top 1000 pages in September:
payload AS HAR
date = '2023-09-01'
AND rank = 1000
The actual number of bytes billed for this query is only 941 MB (thanks to the
WHERE clause). I think there’s a bug in BigQuery where the estimate is inaccurate for anyone who isn’t a project editor, but rest assured this is a relatively inexpensive query.
After you’ve run the query, you have a few options to save the data locally or to Google Drive.
Indeed, it was the estimate that was scaring me. Thanks for clarifying that!