Analyzing stylesheets with a JS-based parser

Crazy idea that I’d like to explore with the help of the HTTP Archive community. I wonder if it’s possible to parse and tokenize a stylesheet in BigQuery using a JS-based parser.

For the Web Almanac, we have many metrics that require understanding stylesheets to answer questions like the number of fonts declared, relative use of em/rem/px/ex/cm, popular snap points in media queries, etc. The most brute force way to analyze these metrics would be to write a regular expression and run it over every stylesheet in the response_bodies table. I’m hoping that there’s a better way.

BigQuery supports user-defined functions (UDFs) written in JavaScript. These functions can load external JS libraries hosted on Cloud Storage and invoke its functions. So I wonder if it’s possible/feasible to actually parse a stylesheet using a JS-based CSS parser like Rework CSS and explore the AST to more reliably answer these questions about the state of the web.

@fhoffa any other ideas?

Yes, this would be very useful. But when I looked into Rework CSS, it seems to also be based on regex-es to parse stylesheets.

The ultimate solution would involve something like Headless Chrome, which would handle all user edge cases out in the wild. I’m just not sure how viable it would to pass through our CSS data from the crawl through Headless Chrome back into our BigQuery.

UPDATE: I found this article, which would imply this is possible!?

1 Like

@tabatkins has what I think is the most accurate parser, assuming there’s been no change I’m unaware of

1 Like

It feels like some of them might be better off as custom metrics accessing document.stylesheets and running the processing logic locally. That way the parsing is handled by the browser and the business logic is fairly well self-contained. That doesn’t help with post-processing the bodies but it will be more comprehensive and will catch inline styles.

1 Like

Hmm - wouldn’t it be significantly cheaper to use a lightweight JS parser like stylis for this? At 3kb, it’s small enough to embed into a bigquery predefined function.

1 Like

I think something like that could be feasible by using Dataflow to glue the response bodies with headless Chrome. Would be quite a bit of work.

Nice, I’d definitely be interested to learn more about that!

Wow, I didn’t know that API existed! We could totally do this work in a custom metric. Too late for the July crawl but worst case scenario we could try to squeeze it into the August crawl. Playing around with it, it seems like it can’t handle some use cases, like Google Fonts:

Maybe a CORS issue? But overall I think this gets us most of the way there.

AFAICT Stylis seems more like a JS-based preprocessor for converting shorthand into pure CSS. Do you know if it also exposes an API into the CSS tokens?

Just to add, I’ve used document.stylesheets a bit and yes, there’s a CORS issue - you can’t see any stylesheets that aren’t from the home domain. Unfortunately that can include a large percentage of stylesheets so you’ll be struggling to get meaningful stats that way.

1 Like

I don’t have any experience with user defined functions but if you could get an NPM package in, then you could use POST CSS with all of its plugins - it’s a CSSOM parser and you’d be certain to get what you need from that.

1 Like

I ran a simple test using Rework:

CREATE TEMP FUNCTION parseCSS(stylesheet STRING)
RETURNS STRING LANGUAGE js AS '''
   return JSON.stringify(parse(stylesheet));
'''
OPTIONS (
  library="gs://httparchive/lib/parse-css.js");
  
SELECT parseCSS('#foo { color: red; }')

parse-css.js is a slightly modified version of Rework’s parse/index.js. I had to change the module.exports to declare a parse function.

It works!

{
  "type": "stylesheet",
  "stylesheet": {
    "rules": [
      {
        "type": "rule",
        "selectors": [
          "#foo"
        ],
        "declarations": [
          {
            "type": "declaration",
            "property": "color",
            "value": "red",
            "position": {
              "start": {
                "line": 1,
                "column": 8
              },
              "end": {
                "line": 1,
                "column": 18
              }
            }
          }
        ],
        "position": {
          "start": {
            "line": 1,
            "column": 1
          },
          "end": {
            "line": 1,
            "column": 21
          }
        }
      }
    ],
    "parsingErrors": []
  }
}

I also ran it against live data:

CREATE TEMP FUNCTION parseCSS(stylesheet STRING)
RETURNS STRING LANGUAGE js AS '''
   return JSON.stringify(parse(stylesheet));
'''
OPTIONS (
  library="gs://httparchive/lib/parse-css.js");
  
SELECT url, parseCSS(body)
FROM `httparchive.sample_data.response_bodies_desktop_1k`
JOIN `httparchive.sample_data.summary_requests_desktop_1k`
USING (url)
WHERE type = 'css'
LIMIT 1

It found an old version of jQuery UI and it was able to parse it successfully (17K lines of JSON):

I’m so happy that folks are looking at in-browser CSS parsers again. I want to share a few suggestions. These suggestions are merely my opinions, and they are only based on my own experience writing and using client-side and server-side CSS parsers for the last decade.

I believe there are 3 things a client-side CSS parser would need to gain significant adoption or reputation in production environments.

  1. The JavaScript would need to have a compressed file size of less than 1 KB. The recently shared parse-css.js has about 4.5 KB of functional code which compresses down to about 1.8 KB.

    1. As Addy Osami wrote “Not all bytes weigh the same. [While] a JPEG image needs to be decoded, rasterized, and painted on the screen, a JavaScript bundle needs to be downloaded and then parsed, compiled, executed —and there are a number of other steps that an engine needs to complete.”
    2. While “1 KB” is just an arbitrary size, it’s also commonly perceived as the lowest significant unit worth counting, similar to an American penny. Therefore, in production JS, anything marketed as being less than 1 KB when compressed is perceived as adding no cost to production.
  2. The parser would need to be “mostly loose”. It would need to be loose enough to support unknown selectors and unknown declarations, but it should avoid look-aheads that could never realistically be added to real CSS.

    1. There are clever ways around this limitation — for instance, the CSS Nesting Specification requires the first character of a nested rule to be either @ or &, which is why you can either write .foo { & .bar {} } instead of .foo .bar, or .foo { @nest .bar & {} } instead of .bar .foo {}.
  3. The parser needs to include certain sugary features. For instance.

    1. The parser needs to also work (or have a version that works) on the server-side.
    2. The parser needs to produce an object that can be traversed and manipulated, like an AST.
    3. The parser needs to include an AST stringifier or some way to write the AST back to a stylesheet in CSSOM.
    4. Light token parsing of selectors and values – something as limited as grouping tokens in parens (think foo (bar bax)) or functions (think var(--foo, bar)).

In my opinion, this parser would not necessarily need to support:

  1. Source maps. The client side version, at least, because source could be traced other ways.
  2. Legacy browser parsing, like /9 or other “hacks”.
  3. Comments. The client side version, at least, because source could be commented other ways.
  4. Visitors or walkers or other sugar traversing utilities that otherwise bump the file size above 1 KB.

Hi @jon_neal, I should clarify that the parser we’re running is part of a data processing pipeline using HTML/CSS collected during monthly synthetic WebPageTest runs to understand the structure/contents of the CSS. Parsing CSS on the client-side is a wholly separate discussion.

Sorry, my fault I think… For clarity, I mentioned this issue to @jon_neal who has done a lot of things that parse CSS and has made a number of comments/opened issues about shortcomings of various parsers in hopes that he could potentially let us know if/what sorts of issues we might run into with whatever parsers we were tossing around. I’d guess sass’s parser must be pretty good actually too, given the amount of use and tests