Analyzing HTML, CSS, and JavaScript response bodies

igrigorik · December 4, 2014, 11:49pm

Earlier this year HTTP Archive began saving response bodies of HTML, CSS, and JavaScript resources. The responses are available for desktop runs only, and binary responses are omitted due to size constraints – even with HTML/CSS/JS-only restriction, each run results in ~250~350GB of payload data.

If you want to run analysis on the response bodies, you can download the raw data and do science to it. Alternatively, as an experiment, I’ve imported the response body payloads directly into BigQuery, which allows you to run queries against them from the comfort of your browser (check out BigQuery query reference docs):

If you’re new to BigQuery, read these instructions.
Head to: https://bigquery.cloud.google.com/project/httparchive
Expand the “runs” dataset and look for following tables:

2014_08_01_requests_body
2014_08_15_requests_body

The schema is very simple: page maps to XXXX_XX_XX_pages.url, url maps to XXXX_XX_XX_requests.url, and rest is self-explanatory. The first two fields allow you to “join” against the requests and/or pages tables to filter the table based on arbitrary conditions - e.g. run analysis on top 1K sites. But enough handwaving…

A sample query courtesy of @stevesoudersorg that counts occurrences of async vs blocking Google Analytics scripts in HTML markup:

SELECT COUNT(*) as num,
 CASE
  WHEN REGEXP_MATCH(body, r'ga\.src.*\.google-analytics\.com/ga\.js') THEN "async"
  WHEN REGEXP_MATCH(body, r'gaJsHost.*\.google-analytics\.com/ga\.js') THEN "sync"
  ELSE "other"
 END as stat
FROM [httparchive:runs.2014_08_15_requests_body]
WHERE mimeType CONTAINS 'html'
  AND REGEXP_MATCH(body, r'\.google-analytics\.com/ga\.js')
GROUP BY stat
ORDER BY num desc

A few seconds later, after processing 200GB of HTML responses:

Note that the body tables are large (~200GB), so its easy to exceed your free quota if you’re not careful. With that in mind, a few tips:

If you’re only interested in particular content-type, restrict your queries by mimeType to speed up processing - e.g. above query checks for “html”.
While you’re iterating on your query, add a LIMIT X clause to avoid executing it against the full dataset.

P.S. At Velocity EU I also demoed “user defined functions” inside of BigQuery - i.e. ability to execute arbitrary JavaScript functions against each record. However, this feature is still in beta and is not yet publicly available. Stay tuned, I’ll share more information and example queries for it once we iron out a few outstanding issues.

kevinSuttle · September 1, 2016, 7:33pm

Is there a way to query the actual content of these pages in terms of HTML, CSS, and JS?

patmeenan · September 2, 2016, 12:21pm

Yes, if you run queries against the HAR data set it has the response bodies for text resources (HTML, CSS, JS)

kevinSuttle · September 2, 2016, 3:00pm

Thanks, Pat. I’m looking to evaluate class names in HTML.

Also does this take into account markup rendered via JS?

patmeenan · September 2, 2016, 5:28pm

No (well, unless you pull the code out of the js). It is providing the raw responses from the wire.

kevinSuttle · September 8, 2016, 1:12am

Sorry, I mixed 2 questions—the first being the more important.

I’m looking to evaluate class names in HTML.

Has someone already done this, and if not, do you know if it’s possible?

zcorpan · September 9, 2016, 10:31am

See e.g. Usage of ARIA attributes for how you could write such a query.

Topic		Replies	Views
How to find that in how many websites H6 is being used? Analysis	2	2032	August 22, 2018
Analyzing stylesheets with a JS-based parser Analysis	14	3590	August 21, 2019
How to research HTTPArchive / -(webkit\|moz)-appearance: menulist-textfield Analysis	0	896	October 30, 2018
Detecting JavaScript sourcemaps Analysis	4	1429	August 7, 2019
Really big queries on BigQuery Analysis	0	2880	June 29, 2018

Analyzing HTML, CSS, and JavaScript response bodies

Related topics