If you want to run analysis on the response bodies, you can download the raw data and do science to it. Alternatively, as an experiment, I’ve imported the response body payloads directly into BigQuery, which allows you to run queries against them from the comfort of your browser (check out BigQuery query reference docs):
- If you’re new to BigQuery, read these instructions.
- Head to: https://bigquery.cloud.google.com/project/httparchive
- Expand the “runs” dataset and look for following tables:
The schema is very simple:
page maps to
url maps to
XXXX_XX_XX_requests.url, and rest is self-explanatory. The first two fields allow you to “join” against the requests and/or pages tables to filter the table based on arbitrary conditions - e.g. run analysis on top 1K sites. But enough handwaving…
A sample query courtesy of @stevesoudersorg that counts occurrences of async vs blocking Google Analytics scripts in HTML markup:
SELECT COUNT(*) as num, CASE WHEN REGEXP_MATCH(body, r'ga\.src.*\.google-analytics\.com/ga\.js') THEN "async" WHEN REGEXP_MATCH(body, r'gaJsHost.*\.google-analytics\.com/ga\.js') THEN "sync" ELSE "other" END as stat FROM [httparchive:runs.2014_08_15_requests_body] WHERE mimeType CONTAINS 'html' AND REGEXP_MATCH(body, r'\.google-analytics\.com/ga\.js') GROUP BY stat ORDER BY num desc
A few seconds later, after processing 200GB of HTML responses:
Note that the
body tables are large (~200GB), so its easy to exceed your free quota if you’re not careful. With that in mind, a few tips:
- If you’re only interested in particular content-type, restrict your queries by
mimeTypeto speed up processing - e.g. above query checks for “html”.
- While you’re iterating on your query, add a
LIMIT Xclause to avoid executing it against the full dataset.