Response body is empty in the crawl dataset

jiboncom · August 3, 2025, 4:49pm

I’ve been trying to retrieve the response_body field for certain websites, but I’m running into some issues. For all data before 2015, the content.response_body field consistently returns NULL. After 2015, the field is populated, but the query size grows very quickly:

2012: 2.25 GB
2015: 7.64 GB
2016: 785 GB
2017: 1.48 TB

Here’s the query I’ve been using:

SELECT  
    page,
    MIN(`date`) AS date,
    ARRAY_AGG(STRUCT(url,type,is_main_document,response_body)) AS content
FROM 
    `httparchive.crawl.requests` 
WHERE 
    date = "2012-07-01" 
    AND client = 'desktop'
    AND TYPE IN ('html', 'script', 'css', 'font')
GROUP BY page

Does anyone know if the response_body field was consistently collected before 2015, or if there’s a better approach for managing the size of these queries for later years?

Best,

Javier

max_ostapenko · August 5, 2025, 7:08pm

Hi Javier,

The response_body data indeed available starting 2015.

Currently there is no cost efficient way to retrieve response_bodies of a particular website in the dataset.

This issue may be related to your question: Cost-efficient alternatives to HAR files · Issue #1092 · HTTPArchive/httparchive.org · GitHub

jiboncom · August 5, 2025, 8:30pm

Hi Max,

Thanks for the information and the link to the issue, that’s exactly our problem. I have made a suggestion there to make things a bit easier and cheaper.

Thanks again!

jiboncom · August 11, 2025, 9:03am

Hi @max_ostapenko ,
Can you help me out with 2 more things? I tried querying the `httparchive.blink_features.usage` and I couldn’t get data before 2017. Is the earliest available data or I am doing something wrong? Also, is there any way to get the response_bodies before 2015? Maybe the old HAR files?

Many thanks!

max_ostapenko · August 12, 2025, 6:42pm

Hi @jiboncom ,

The Blink features metrics were collected starting 2017:

SELECT DISTINCT
  date
FROM crawl.pages
WHERE
  date < '2017-06-01' AND
  ARRAY_LENGTH(features) > 0
ORDER BY 1

HTTP Archive doesn’t store HAR file copies, so before 2015 there is no request information available.
Starting with 2015 all the HAR information should be available in crawl.pages and crawl.requests tables.

Topic		Replies	Views
New Release: `httparchive.crawl` Dataset Meta	0	383	November 20, 2024
HAR data missing between 2012 and 2016?	1	1684	October 5, 2017
Does BigQuery contain HAR Archive or cookies of crawled webpages?	13	2992	June 11, 2021
Downloading HAR-Datasets later than May 2022? Meta	4	1137	October 16, 2023
Where to find http headers in the http archive datasets Analysis	2	1545	April 7, 2022

Response body is empty in the crawl dataset

Related topics