On Twitter, @v21 asks:
Does anyone want to take a stab at researching this one?
On Twitter, @v21 asks:
Does anyone want to take a stab at researching this one?
I think one of the challenges here is that the least used color will be used just once. And there are many colors that appear to be used only once.
I ran the following query to extract color hex strings from CSS files the request_bodies table, and it returned a list of 88,231 unique color combinations. (Note: this doesn’t factor colors set by HTML and JS). The query I ran below, although be warned that it will chew through 770GB of processed data…
SELECT color, count(*)
FROM (
SELECT page, url, REGEXP_EXTRACT(LOWER(body), r'color[\s]?:[\s]?#([A-Fa-f0-9]+)') AS color
FROM httparchive.har.2017_09_01_chrome_requests_bodies
WHERE url LIKE "%.css"
) GROUP BY color
The Regex match is basically looking for color:#FFFFFF
, where FFFFFF
is any hex string, and spaces are optionally allowed before/after the :
character.
Out of 88,231 unique colors, 64,335 of them were used only once! Since there’s a lot of colors tied for last place, I don’t think this can be answered.
BTW - the most common color is #FFFFFF (and it’s shorthand #FFF), which is just white. I’m guessing that I caught a lot of background colors in my search!
Another interesting tidbit I ran into while looking at this is the length of the color codes. I was surprised to see that the shorthand RGB codes are used slightly more than the 6 char RRGGBB versions.
Actually, looking at the REGEXP_EXTRACT documentation this is only capturing the first CSS color in each URL. Anyone know of a way to extend this to capture all of them? I tried replacing REGEXP_EXTRACT with REGEXP_EXTRACT_ALL, but it didn’t seem to return any results.
Ah ha! I just learning something new today :). Reading up on working with arrays in BigQuery here - https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays.
Here’s an updated query that accounts for all colors in the CSS files:
SELECT color, count(*) color_count
FROM
(
SELECT REGEXP_EXTRACT_ALL(LOWER(body), r'color[\s]?:[\s]?#([A-Fa-f0-9]+)') colors
FROM httparchive.har.2017_09_01_chrome_requests_bodies
WHERE url LIKE "%.css"
) color_arrays
CROSS JOIN UNNEST(color_arrays.colors) color
GROUP BY color
ORDER BY color_count DESC
So now I have a list of 1,728,903 unique colors used on the web. Ultimately, I came to the same conclusion as before with the least common color though. There are 604,286 colors that are used exactly once across all sites. That’s a lot tied for last place. We could take this one step further and look for a color used exactly once on the lowest Alexa ranked site - but there are numerous sites in the HTTP Archive that have a null rank - so that would also produce numerous colors tied for last place!
BTW - the most frequent color is white if you combine #FFF and #FFFFFF. Next is black (#0).
and because treemaps are so much fun to explore this type of data with -
Lastly, I mentioned above that the length of the color code seemed interesting. Looking at the full dataset, I see 39% of color codes are #RGB while 61% are #RRGGBB
This is really cool, thanks for digging into this Paul! A couple of observations:
WHERE url LIKE "%.css"
This restricts the search to files ending in .css
but people frequently use cache-busting parameters after the file extension. You could work around that by changing the clause to WHERE url LIKE "%.css%"
but that brings me to my next thought. HTML documents can also have inline style blocks. So maybe instead of matching file extensions, you could look at the content type of the response and filter out resources that aren’t html or stylesheet.
r'color[\s]?:[\s]?#([A-Fa-f0-9]+)'
This is limited to colors represented in hex, but there are many other representations (hsl, rgba, color name). If your regex fu is strong, you could try extracting just the style value after the color
name and grouping by the values. Should also strip out things like !important
. One interesting analysis may just be to limit the query to color names (black, white, red, palegoldenrod, etc). Since the list is finite it may be easier to query against a whitelist. That was mentioned in the tweet thread, so it may be the original intent.
Did some more analysis to bucket the colors from Paul’s data down to just 125 and find the most common ones.
White/black/grey dominates the top with moderate blues and reds with neon greens, purples, pinks dominating the bottom. Interesting stuff!
This question was answered in the Web Almanace 2022 CSS Survey.
They sampled sites and found the least used named color was mediumspringgreen.