Analyzing HTML, CSS, and JavaScript response bodies


#1

Earlier this year HTTP Archive began saving response bodies of HTML, CSS, and JavaScript resources. The responses are available for desktop runs only, and binary responses are omitted due to size constraints – even with HTML/CSS/JS-only restriction, each run results in ~250~350GB of payload data.

If you want to run analysis on the response bodies, you can download the raw data and do science to it. Alternatively, as an experiment, I’ve imported the response body payloads directly into BigQuery, which allows you to run queries against them from the comfort of your browser (check out BigQuery query reference docs):

  • 2014_08_01_requests_body
  • 2014_08_15_requests_body

The schema is very simple: page maps to XXXX_XX_XX_pages.url, url maps to XXXX_XX_XX_requests.url, and rest is self-explanatory. The first two fields allow you to “join” against the requests and/or pages tables to filter the table based on arbitrary conditions - e.g. run analysis on top 1K sites. But enough handwaving…

A sample query courtesy of @stevesoudersorg that counts occurrences of async vs blocking Google Analytics scripts in HTML markup:

SELECT COUNT(*) as num,
 CASE
  WHEN REGEXP_MATCH(body, r'ga\.src.*\.google-analytics\.com/ga\.js') THEN "async"
  WHEN REGEXP_MATCH(body, r'gaJsHost.*\.google-analytics\.com/ga\.js') THEN "sync"
  ELSE "other"
 END as stat
FROM [httparchive:runs.2014_08_15_requests_body]
WHERE mimeType CONTAINS 'html'
  AND REGEXP_MATCH(body, r'\.google-analytics\.com/ga\.js')
GROUP BY stat
ORDER BY num desc

A few seconds later, after processing 200GB of HTML responses:

Note that the body tables are large (~200GB), so its easy to exceed your free quota if you’re not careful. With that in mind, a few tips:

  • If you’re only interested in particular content-type, restrict your queries by mimeType to speed up processing - e.g. above query checks for “html”.
  • While you’re iterating on your query, add a LIMIT X clause to avoid executing it against the full dataset.

P.S. At Velocity EU I also demoed “user defined functions” inside of BigQuery - i.e. ability to execute arbitrary JavaScript functions against each record. However, this feature is still in beta and is not yet publicly available. Stay tuned, I’ll share more information and example queries for it once we iron out a few outstanding issues.


#2

Is there a way to query the actual content of these pages in terms of HTML, CSS, and JS?


#3

Yes, if you run queries against the HAR data set it has the response bodies for text resources (HTML, CSS, JS)


#4

Thanks, Pat. I’m looking to evaluate class names in HTML.

Also does this take into account markup rendered via JS?


#5

No (well, unless you pull the code out of the js). It is providing the raw responses from the wire.


#6

Sorry, I mixed 2 questions—the first being the more important.

I’m looking to evaluate class names in HTML.

Has someone already done this, and if not, do you know if it’s possible?


#7

See e.g. Usage of ARIA attributes for how you could write such a query.