Analyzing Lighthouse Scores Across the Web

Lighthouse is an amazing tool that you can use to quickly audit a web page and learn how it stacks up on performance, accessibility, best practices, PWA support and more. You can run it from ChromeDevTools, run one via WebPageTest measurement or analyze them in bulk here! For every page measured in the HTTP Archive a lighthouse audit is run, and the results are stored in the lighthouse tables.

Recently while analyzing a poor performing website, I wondered how many sites in the HTTP Archive perform so poorly that they would get a single digit Performance score!

Fortunately, the results are fairly easy to query for. Since the Lighthouse report is stored as a JSON array, it’s just a matter of extracting the JSON variables using the JSON_EXTRACT function.:

JSON_EXTRACT(report, '$.categories.performance.score')

If we want to analyze the scores, then we will have to CAST the output from the above example as numeric.

CAST(JSON_EXTRACT(report, '$.categories.performance.score') as NUMERIC)

I thought it would be interesting to look at some of the ranges of scores for the Performance category across the 4 million websites tracked in the HTTP Archive. How many scored good on performance, SEO, etc? And how many score poorly? Here’s a query that will summarize all of the performance scores (Note: this will process 895GB, which would consume most of the free monthly quota):

SELECT "Performance" as Category,
       COUNTIF(score<.10) LT10,
       COUNTIF(score >=.1 and score < .25) Between10_25,
       COUNTIF(score >=.25 and score < .50) Between25_50,
       COUNTIF(score >=.50 and score < .75) Between50_75,
       COUNTIF(score >=.75 and score < .9) Between75_90,
       COUNTIF(score >=.9) GT90
FROM (
       SELECT CAST(JSON_EXTRACT(report, '$.categories.performance.score') as NUMERIC) AS score
       FROM `httparchive.latest.lighthouse_mobile`
)

There are 444,657 websites that have a Performance score less than 10%!

We can also UNION a bunch of queries together to add some of the other categories to the results. The following query also uses 895GB.

SELECT "Performance" as Category,
       COUNTIF(score<.10) LT10,
       COUNTIF(score >=.1 and score < .25) Between10_25,
       COUNTIF(score >=.25 and score < .50) Between25_50,
       COUNTIF(score >=.50 and score < .75) Between50_75,
       COUNTIF(score >=.75 and score < .9) Between75_90,
       COUNTIF(score >=.9) GT90

FROM (
    SELECT CAST(JSON_EXTRACT(report, '$.categories.performance.score') as NUMERIC) AS score
    FROM `httparchive.latest.lighthouse_mobile`
)
UNION ALL
SELECT "Accessibiity" as Category,
       COUNTIF(score<.10) LT10,
       COUNTIF(score >=.1 and score < .25) Between10_25,
       COUNTIF(score >=.25 and score < .50) Between25_50,
       COUNTIF(score >=.50 and score < .75) Between50_75,
       COUNTIF(score >=.75 and score < .9) Between75_90,
       COUNTIF(score >=.9) GT90
FROM (
    SELECT CAST(JSON_EXTRACT(report, '$.categories.accessibility.score') as NUMERIC) AS score
    FROM `httparchive.latest.lighthouse_mobile`
)
UNION ALL
SELECT "Best Practices" as Category,
       COUNTIF(score<.10) LT10,
       COUNTIF(score >=.1 and score < .25) Between10_25,
       COUNTIF(score >=.25 and score < .50) Between25_50,
       COUNTIF(score >=.50 and score < .75) Between50_75,
       COUNTIF(score >=.75 and score < .9) Between75_90,
       COUNTIF(score >=.9) GT90
FROM (
    SELECT CAST(JSON_EXTRACT(report, '$.categories.best-practices.score') as NUMERIC) AS score
    FROM `httparchive.latest.lighthouse_mobile`
)
UNION ALL
SELECT "SEO" as Category,
       COUNTIF(score<.10) LT10,
       COUNTIF(score >=.1 and score < .25) Between10_25,
       COUNTIF(score >=.25 and score < .50) Between25_50,
       COUNTIF(score >=.50 and score < .75) Between50_75,
       COUNTIF(score >=.75 and score < .9) Between75_90,
       COUNTIF(score >=.9) GT90
FROM (
    SELECT CAST(JSON_EXTRACT(report, '$.categories.seo.score') as NUMERIC) AS score
    FROM `httparchive.latest.lighthouse_mobile`
)
UNION ALL
SELECT "PWA" as Category,
       COUNTIF(score<.10) LT10,
       COUNTIF(score >=.1 and score < .25) Between10_25,
       COUNTIF(score >=.25 and score < .50) Between25_50,
       COUNTIF(score >=.50 and score < .75) Between50_75,
       COUNTIF(score >=.75 and score < .9) Between75_90,
       COUNTIF(score >=.9) GT90
FROM (
    SELECT CAST(JSON_EXTRACT(report, '$.categories.pwa.score') as NUMERIC) AS score
    FROM `httparchive.latest.lighthouse_mobile`
)

The results count the number of sites that have lighthouse scores < 10%, 10-25%, 25-50%, 50-75% and > 90%.

If we look at these in terms of %s of total, we can see that the performance scores follow a normal distribution, with almost as many poorly performing sites as there are sites with excellent performance scores. The majority of sites pass the SEO audits. PWA support is relatively low across the web with only 3518 sites achieving a >90% score. But we can track their adoption report and see that the scores are increasing each month!

How about giving the web a Lighthouse summary score? We can use the following query to calculate the median score across all 4 million websites (note: 895GB query)

SELECT APPROX_QUANTILES(CAST(JSON_EXTRACT(report, '$.categories.performance.score') as NUMERIC), 100)[SAFE_ORDINAL(50)] AS Performance,
	   APPROX_QUANTILES(CAST(JSON_EXTRACT(report, '$.categories.accessibility.score') as NUMERIC), 100)[SAFE_ORDINAL(50)] AS Accessibility,
	   APPROX_QUANTILES(CAST(JSON_EXTRACT(report, '$.categories.best-practices.score') as NUMERIC), 100)[SAFE_ORDINAL(50)] AS BestPractices,
	   APPROX_QUANTILES(CAST(JSON_EXTRACT(report, '$.categories.seo.score') as NUMERIC), 100)[SAFE_ORDINAL(50)] AS SEO,
	   APPROX_QUANTILES(CAST(JSON_EXTRACT(report, '$.categories.pwa.score') as NUMERIC), 100)[SAFE_ORDINAL(50)] AS PWA
FROM `httparchive.latest.lighthouse_mobile` 

Clearly we have much work to do on Performance, Accessibility and Best Practices!

3 Likes

Brilliant. This is great insight Paul. Thanks for this.

Well, that’s certainly depressing to see where developer concerns lie.

Additionally, I’m starting to worry that between this and the dataset changing to the CrUX report, our web analysis is becoming extremely Chrome-centric.

I’d be really curious to hear more about what your worries are. Could you elaborate?

Shouldn’t we expect a normal distribution based on the way the dataset is put together?
With the larger dataset, I’d expect to the choice of CMS / theme to dominate many of these issues. There will lots of websites clicked together and subsequently only maintained if a new software release is forced on them.

1 Like

Sure thing. I worry that that this situation makes it easy to create blind spots.

  • If the analyzed URLs are from Chrome, domains that work poorly in Chrome may be underrepresented.

  • Sites serving countries that prefer other browsers will be even more underrepresented, and this could be dangerous from an imperialism perspective.

  • The Chrome team publishes great performance advice, but Blink has idiosyncrasies like all engines. With data coming from sites Chrome already enjoys the most usage on, it would be really easy for Lighthouse advice to skip optimizations that see no benefit in Chrome, but do in other browsers. (Like that MR for incremental-dom that wasn’t merged because while giving Firefox a huge boost, it was negligible for Chrome.)

    • I often see advice from the Chrome team about avoiding inline SVG in favor of icon fonts, or even avoiding SVG altogether due to lack of hardware acceleration. This worries me when SVG is clearly the right tool for the job, for accessibility or robustness reasons.
  • It would be easy for Lighthouse to give performance advice that is beneficial to Chrome, but negligible or possibly even harmful to other engines. (I’m mostly thinking about Chrome’s GPU promotion heuristics here.)

  • Chrome has historically been kind of pants at accessibility and interfacing with screen-readers (see text-transform:uppercase triggering acronym pronunciation heuristics). The Lighthouse accessibility advice may be well-intentioned, but unnecessary or even harmful in the browser/reader combinations that disabled users actually employ, instead of what most devs probably test with: the unsupported VoiceOver + Chrome combination.

Ultimately, it makes me uncomfortable that the same source channels much usage of the web with its search engine, drives much of web standards, shapes user expectations and habits through an overwhelmingly popular browser, and now provides the source of what history is acceptable to analyze for trends.

The Chrome team has stated that browser competition is crucial for a healthy web: experimentation in multiple directions at once (asm.js vs. PNaCL), the ability to perform failed experiments for useful data (no interpreter in v8), and good ol’ stealing ideas from each other (Safari’s ongoing work in tracking prevention, that selector memory shootout between Chrome and FF).

The Alexa list was flawed. CrUX does contain more, less gameable data. But if we truly want to not be/do evil, we should look to supplement the CrUX report with data from other sources as well. Avoiding negligence, really.

1 Like