Update on parsing and querying CSS properties/styles.
The parser was timing out when attempting to run it over all ~83M stylesheets. Not that the quantity of the stylesheets was a problem, but the script itself was taking 5+ minutes to parse specific stylesheets, which caused the query to timeout and fail.
The remediation process is boring but I’ll describe it here for posterity:
- I binary searched the ~83M stylesheets until I found a single stylesheet that triggered the timeout
- I locally parsed that stylesheet with debugging statements in the script to elucidate which part of the script was getting hung up
- It became clear that it was a particular selector that was taking a very long time to trim whitespace. The selector itself wasn’t remarkable except for the thousands of tab characters in the middle of it. The trim function only removed leading and trailing whitespace, not thousands of tabs in the middle. (more info on the debugging in this thread)
- I modified the parser to collapse repeated whitespace down to a single space character. CSS shouldn’t be whitespace sensitive (beyond the descendent combinator, which is just a single space) so this change should be safe.
- I ran the following query to save the resulting JSON object for each stylesheet to a dedicated partitioned/clustered table
#standardSQL
CREATE TEMP FUNCTION parseCSS(stylesheet STRING)
RETURNS STRING LANGUAGE js AS '''
try {
var css = parse(stylesheet)
return JSON.stringify(css);
} catch (e) {
'';
}
'''
OPTIONS (library="gs://httparchive/lib/parse-css.js");
CREATE TABLE `httparchive.almanac.parsed_css`
PARTITION BY date
CLUSTER BY client, page, url AS
SELECT
date,
client,
page,
url,
parseCSS(body) AS css
FROM
`httparchive.almanac.summary_response_bodies`
WHERE
type = 'css'
The resulting httparchive.almanac.parsed_css table has ~83M rows and is 13.4 TB.
Here’s an example of querying the table to find the top 10 color values:
#standardSQL
CREATE TEMPORARY FUNCTION getAllValues(css STRING, property STRING)
RETURNS ARRAY<STRING> LANGUAGE js AS '''
try {
var reduceValues = (values, rule) => {
if (rule.type == 'media') {
return reduceValues(values, rule.rules);
}
return values.concat(rule.declarations.filter(d => d.property == property).map(d => d.value));
};
var $ = JSON.parse(css);
return $.stylesheet.rules.reduce(reduceValues, []);
} catch (e) {
return [];
}
''';
SELECT
APPROX_TOP_COUNT(color, 10)
FROM
`httparchive.almanac.parsed_css`,
UNNEST(getAllValues(css, 'color')) AS color
| color | frequency |
|---|---|
| #fff | 10504119 |
| #000 | 4844268 |
| #333 | 2842007 |
| #ffffff | 2263685 |
| #999 | 2031777 |
| #90949c | 1789373 |
| #666 | 1617707 |
| #888 | 1437182 |
| #1d2129 | 1405315 |
| #000000 | 1344227 |
This table should help unlock many of the CSS chapter metrics for the Almanac.