Analyzing stylesheets with a JS-based parser

Update on parsing and querying CSS properties/styles.

The parser was timing out when attempting to run it over all ~83M stylesheets. Not that the quantity of the stylesheets was a problem, but the script itself was taking 5+ minutes to parse specific stylesheets, which caused the query to timeout and fail.

The remediation process is boring but I’ll describe it here for posterity:

  • I binary searched the ~83M stylesheets until I found a single stylesheet that triggered the timeout
  • I locally parsed that stylesheet with debugging statements in the script to elucidate which part of the script was getting hung up
  • It became clear that it was a particular selector that was taking a very long time to trim whitespace. The selector itself wasn’t remarkable except for the thousands of tab characters in the middle of it. The trim function only removed leading and trailing whitespace, not thousands of tabs in the middle. (more info on the debugging in this thread)
  • I modified the parser to collapse repeated whitespace down to a single space character. CSS shouldn’t be whitespace sensitive (beyond the descendent combinator, which is just a single space) so this change should be safe.
  • I ran the following query to save the resulting JSON object for each stylesheet to a dedicated partitioned/clustered table
#standardSQL
CREATE TEMP FUNCTION parseCSS(stylesheet STRING)
RETURNS STRING LANGUAGE js AS '''
  try {
   var css = parse(stylesheet)
   return JSON.stringify(css);
  } catch (e) {
    '';
  }
'''
OPTIONS (library="gs://httparchive/lib/parse-css.js");

CREATE TABLE `httparchive.almanac.parsed_css`
PARTITION BY date
CLUSTER BY client, page, url AS
SELECT
  date,
  client,
  page,
  url,
  parseCSS(body) AS css
FROM
  `httparchive.almanac.summary_response_bodies`
WHERE
  type = 'css'

The resulting httparchive.almanac.parsed_css table has ~83M rows and is 13.4 TB.

Here’s an example of querying the table to find the top 10 color values:

#standardSQL
CREATE TEMPORARY FUNCTION getAllValues(css STRING, property STRING)
RETURNS ARRAY<STRING> LANGUAGE js AS '''
try {
  var reduceValues = (values, rule) => {
    if (rule.type == 'media') {
      return reduceValues(values, rule.rules);
    }
    
    return values.concat(rule.declarations.filter(d => d.property == property).map(d => d.value));
  };
  var $ = JSON.parse(css);
  return $.stylesheet.rules.reduce(reduceValues, []);
} catch (e) {
  return [];
}
''';

SELECT
  APPROX_TOP_COUNT(color, 10)
FROM
  `httparchive.almanac.parsed_css`,
  UNNEST(getAllValues(css, 'color')) AS color
color frequency
#fff 10504119
#000 4844268
#333 2842007
#ffffff 2263685
#999 2031777
#90949c 1789373
#666 1617707
#888 1437182
#1d2129 1405315
#000000 1344227

This table should help unlock many of the CSS chapter metrics for the Almanac.