Crazy idea that I’d like to explore with the help of the HTTP Archive community. I wonder if it’s possible to parse and tokenize a stylesheet in BigQuery using a JS-based parser.
For the Web Almanac, we have many metrics that require understanding stylesheets to answer questions like the number of fonts declared, relative use of em/rem/px/ex/cm, popular snap points in media queries, etc. The most brute force way to analyze these metrics would be to write a regular expression and run it over every stylesheet in the response_bodies table. I’m hoping that there’s a better way.
BigQuery supports user-defined functions (UDFs) written in JavaScript. These functions can load external JS libraries hosted on Cloud Storage and invoke its functions. So I wonder if it’s possible/feasible to actually parse a stylesheet using a JS-based CSS parser like Rework CSS and explore the AST to more reliably answer these questions about the state of the web.
Yes, this would be very useful. But when I looked into Rework CSS, it seems to also be based on regex-es to parse stylesheets.
The ultimate solution would involve something like Headless Chrome, which would handle all user edge cases out in the wild. I’m just not sure how viable it would to pass through our CSS data from the crawl through Headless Chrome back into our BigQuery.
UPDATE: I found this article, which would imply this is possible!?
It feels like some of them might be better off as custom metrics accessing document.stylesheets and running the processing logic locally. That way the parsing is handled by the browser and the business logic is fairly well self-contained. That doesn’t help with post-processing the bodies but it will be more comprehensive and will catch inline styles.
Hmm - wouldn’t it be significantly cheaper to use a lightweight JS parser like stylis for this? At 3kb, it’s small enough to embed into a bigquery predefined function.
I think something like that could be feasible by using Dataflow to glue the response bodies with headless Chrome. Would be quite a bit of work.
Nice, I’d definitely be interested to learn more about that!
Wow, I didn’t know that API existed! We could totally do this work in a custom metric. Too late for the July crawl but worst case scenario we could try to squeeze it into the August crawl. Playing around with it, it seems like it can’t handle some use cases, like Google Fonts:
Maybe a CORS issue? But overall I think this gets us most of the way there.
AFAICT Stylis seems more like a JS-based preprocessor for converting shorthand into pure CSS. Do you know if it also exposes an API into the CSS tokens?
Just to add, I’ve used document.stylesheets a bit and yes, there’s a CORS issue - you can’t see any stylesheets that aren’t from the home domain. Unfortunately that can include a large percentage of stylesheets so you’ll be struggling to get meaningful stats that way.
I don’t have any experience with user defined functions but if you could get an NPM package in, then you could use POST CSS with all of its plugins - it’s a CSSOM parser and you’d be certain to get what you need from that.
CREATE TEMP FUNCTION parseCSS(stylesheet STRING)
RETURNS STRING LANGUAGE js AS '''
return JSON.stringify(parse(stylesheet));
'''
OPTIONS (
library="gs://httparchive/lib/parse-css.js");
SELECT url, parseCSS(body)
FROM `httparchive.sample_data.response_bodies_desktop_1k`
JOIN `httparchive.sample_data.summary_requests_desktop_1k`
USING (url)
WHERE type = 'css'
LIMIT 1
It found an old version of jQuery UI and it was able to parse it successfully (17K lines of JSON):
I’m so happy that folks are looking at in-browser CSS parsers again. I want to share a few suggestions. These suggestions are merely my opinions, and they are only based on my own experience writing and using client-side and server-side CSS parsers for the last decade.
I believe there are 3 things a client-side CSS parser would need to gain significant adoption or reputation in production environments.
The JavaScript would need to have a compressed file size of less than 1 KB. The recently shared parse-css.js has about 4.5 KB of functional code which compresses down to about 1.8 KB.
As Addy Osami wrote “Not all bytes weigh the same. [While] a JPEG image needs to be decoded, rasterized, and painted on the screen, a JavaScript bundle needs to be downloaded and then parsed, compiled, executed —and there are a number of other steps that an engine needs to complete.”
While “1 KB” is just an arbitrary size, it’s also commonly perceived as the lowest significant unit worth counting, similar to an American penny. Therefore, in production JS, anything marketed as being less than 1 KB when compressed is perceived as adding no cost to production.
The parser would need to be “mostly loose”. It would need to be loose enough to support unknown selectors and unknown declarations, but it should avoid look-aheads that could never realistically be added to real CSS.
There are clever ways around this limitation — for instance, the CSS Nesting Specification requires the first character of a nested rule to be either @ or &, which is why you can either write .foo { & .bar {} } instead of .foo .bar, or .foo { @nest .bar & {} } instead of .bar .foo {}.
The parser needs to include certain sugary features. For instance.
The parser needs to also work (or have a version that works) on the server-side.
The parser needs to produce an object that can be traversed and manipulated, like an AST.
The parser needs to include an AST stringifier or some way to write the AST back to a stylesheet in CSSOM.
Light token parsing of selectors and values – something as limited as grouping tokens in parens (think foo (bar bax)) or functions (think var(--foo, bar)).
In my opinion, this parser would not necessarily need to support:
Source maps. The client side version, at least, because source could be traced other ways.
Legacy browser parsing, like /9 or other “hacks”.
Comments. The client side version, at least, because source could be commented other ways.
Visitors or walkers or other sugar traversing utilities that otherwise bump the file size above 1 KB.
Hi @jon_neal, I should clarify that the parser we’re running is part of a data processing pipeline using HTML/CSS collected during monthly synthetic WebPageTest runs to understand the structure/contents of the CSS. Parsing CSS on the client-side is a wholly separate discussion.
Sorry, my fault I think… For clarity, I mentioned this issue to @jon_neal who has done a lot of things that parse CSS and has made a number of comments/opened issues about shortcomings of various parsers in hopes that he could potentially let us know if/what sorts of issues we might run into with whatever parsers we were tossing around. I’d guess sass’s parser must be pretty good actually too, given the amount of use and tests
Update on parsing and querying CSS properties/styles.
The parser was timing out when attempting to run it over all ~83M stylesheets. Not that the quantity of the stylesheets was a problem, but the script itself was taking 5+ minutes to parse specific stylesheets, which caused the query to timeout and fail.
The remediation process is boring but I’ll describe it here for posterity:
I binary searched the ~83M stylesheets until I found a single stylesheet that triggered the timeout
I locally parsed that stylesheet with debugging statements in the script to elucidate which part of the script was getting hung up
It became clear that it was a particular selector that was taking a very long time to trim whitespace. The selector itself wasn’t remarkable except for the thousands of tab characters in the middle of it. The trim function only removed leading and trailing whitespace, not thousands of tabs in the middle. (more info on the debugging in this thread)
I modified the parser to collapse repeated whitespace down to a single space character. CSS shouldn’t be whitespace sensitive (beyond the descendent combinator, which is just a single space) so this change should be safe.
I ran the following query to save the resulting JSON object for each stylesheet to a dedicated partitioned/clustered table
#standardSQL
CREATE TEMP FUNCTION parseCSS(stylesheet STRING)
RETURNS STRING LANGUAGE js AS '''
try {
var css = parse(stylesheet)
return JSON.stringify(css);
} catch (e) {
'';
}
'''
OPTIONS (library="gs://httparchive/lib/parse-css.js");
CREATE TABLE `httparchive.almanac.parsed_css`
PARTITION BY date
CLUSTER BY client, page, url AS
SELECT
date,
client,
page,
url,
parseCSS(body) AS css
FROM
`httparchive.almanac.summary_response_bodies`
WHERE
type = 'css'
I made a couple of changes to the parsed_css table:
Modified the parser script to omit position properties and entries for CSS comments, reducing the size of the table from 13.4 TB to 6.65 TB
Parsed and included inline style blocks, which will appear in the table as having a URL value of inline
Here’s the query to parse the inline style blocks:
#standardSQL
CREATE TEMP FUNCTION parseCSS(stylesheet STRING)
RETURNS STRING LANGUAGE js AS '''
try {
var css = parse(stylesheet)
return JSON.stringify(css);
} catch (e) {
'';
}
'''
OPTIONS (library="gs://httparchive/lib/parse-css.js");
SELECT
date,
client,
page,
'inline' AS url,
parseCSS(style) AS css
FROM
(SELECT date, client, page, url, body FROM `httparchive.almanac.summary_response_bodies` WHERE firstHtml),
UNNEST(REGEXP_EXTRACT_ALL(body, '(?i)<style[^>]*>(.*)</style>')) AS style
WHERE
style IS NOT NULL AND LENGTH(style) > 0
After these changes, the table is now 6.7 TB with 91M rows.