I’ve been trying to retrieve the response_body field for certain websites, but I’m running into some issues. For all data before 2015, the content.response_body field consistently returns NULL. After 2015, the field is populated, but the query size grows very quickly:
-
2012: 2.25 GB
-
2015: 7.64 GB
-
2016: 785 GB
-
2017: 1.48 TB
Here’s the query I’ve been using:
SELECT
page,
MIN(`date`) AS date,
ARRAY_AGG(STRUCT(url,type,is_main_document,response_body)) AS content
FROM
`httparchive.crawl.requests`
WHERE
date = "2012-07-01"
AND client = 'desktop'
AND TYPE IN ('html', 'script', 'css', 'font')
GROUP BY page
Does anyone know if the response_body field was consistently collected before 2015, or if there’s a better approach for managing the size of these queries for later years?
Best,
Javier
Hi Javier,
The response_body data indeed available starting 2015.
Currently there is no cost efficient way to retrieve response_bodies of a particular website in the dataset.
This issue may be related to your question: Cost-efficient alternatives to HAR files · Issue #1092 · HTTPArchive/httparchive.org · GitHub
Hi Max,
Thanks for the information and the link to the issue, that’s exactly our problem. I have made a suggestion there to make things a bit easier and cheaper.
Thanks again!
Hi @max_ostapenko ,
Can you help me out with 2 more things? I tried querying the `httparchive.blink_features.usage` and I couldn’t get data before 2017. Is the earliest available data or I am doing something wrong? Also, is there any way to get the response_bodies before 2015? Maybe the old HAR files?
Many thanks!
Hi @jiboncom ,
The Blink features metrics were collected starting 2017:
SELECT DISTINCT
date
FROM crawl.pages
WHERE
date < '2017-06-01' AND
ARRAY_LENGTH(features) > 0
ORDER BY 1
HTTP Archive doesn’t store HAR file copies, so before 2015 there is no request information available.
Starting with 2015 all the HAR information should be available in crawl.pages and crawl.requests tables.