Earlier this year HTTP Archive began saving response bodies of HTML, CSS, and JavaScript resources. The responses are available for desktop runs only, and binary responses are omitted due to size constraints – even with HTML/CSS/JS-only restriction, each run results in ~250~350GB of payload data.
If you want to run analysis on the response bodies, you can download the raw data and do science to it. Alternatively, as an experiment, I’ve imported the response body payloads directly into BigQuery, which allows you to run queries against them from the comfort of your browser (check out BigQuery query reference docs):
- If you’re new to BigQuery, read these instructions.
- Head to: https://bigquery.cloud.google.com/project/httparchive
- Expand the “runs” dataset and look for following tables:
- 2014_08_01_requests_body
- 2014_08_15_requests_body
The schema is very simple: page
maps to XXXX_XX_XX_pages.url
, url
maps to XXXX_XX_XX_requests.url
, and rest is self-explanatory. The first two fields allow you to “join” against the requests and/or pages tables to filter the table based on arbitrary conditions - e.g. run analysis on top 1K sites. But enough handwaving…
A sample query courtesy of @stevesoudersorg that counts occurrences of async vs blocking Google Analytics scripts in HTML markup:
SELECT COUNT(*) as num,
CASE
WHEN REGEXP_MATCH(body, r'ga\.src.*\.google-analytics\.com/ga\.js') THEN "async"
WHEN REGEXP_MATCH(body, r'gaJsHost.*\.google-analytics\.com/ga\.js') THEN "sync"
ELSE "other"
END as stat
FROM [httparchive:runs.2014_08_15_requests_body]
WHERE mimeType CONTAINS 'html'
AND REGEXP_MATCH(body, r'\.google-analytics\.com/ga\.js')
GROUP BY stat
ORDER BY num desc
A few seconds later, after processing 200GB of HTML responses:
Note that the body
tables are large (~200GB), so its easy to exceed your free quota if you’re not careful. With that in mind, a few tips:
- If you’re only interested in particular content-type, restrict your queries by
mimeType
to speed up processing - e.g. above query checks for “html”. - While you’re iterating on your query, add a
LIMIT X
clause to avoid executing it against the full dataset.
P.S. At Velocity EU I also demoed “user defined functions” inside of BigQuery - i.e. ability to execute arbitrary JavaScript functions against each record. However, this feature is still in beta and is not yet publicly available. Stay tuned, I’ll share more information and example queries for it once we iron out a few outstanding issues.