Lighthouse scores as predictors of page-level CrUX data

Inspired by Rick’s earlier thread “Correlating Lighthouse scores to page-level CrUX data”, I started looking at the question of how to evaluate the Lighthouse score as a predictor of field data. As discussed in that thread, basing it on raw correlation alone is problematic, but what is a good way to measure efficacy of the Lighthouse score in terms of Core Web Vitals (CWV)?

A Lighthouse scoring FAQ gives one way of approaching the question: “the lower the [Lighthouse] score, the more likely the user will struggle with load performance, responsiveness, or content stability”. This likelihood is something we can quantify:

If we have two pages with different Lighthouse scores, is the higher-scored page more likely to meet the recommended Core Web Vitals thresholds?

Limiting the conclusions to just the URLs in the September 2021 HTTP Archive dataset (“if we have two pages from the HTTP Archive dataset with different Lighthouse scores…”), we can assess the likelihood as the percentage of pages that meet all three Core Web Vital “good” thresholds at each Lighthouse score, 0-100.

There’s some noise and the passing rate starts to level off in the higher Lighthouse scores, but we can say pretty definitively that the higher LH score, the more likely a page is to meet all three CWV thresholds, at least for these URLs.

But this is hiding some detail: a page easily meeting two of the CWV thresholds and just barely missing the third would show up as a “failure” in this graph. Looking at the percentage meeting at least two and then at least one threshold starts to give some idea of the full shape of how better Lighthouse scores correspond with better CWV numbers:

This reveals that a large number of the pages that score well in Lighthouse but do not meet all the recommended CWV thresholds are really only missing one of the three thresholds.

A continuous approach

But what about when a page almost meets a threshold? It seems strange to treat a page with a 3-second LCP in the field the same as a page with a 10-second LCP. To help include pages that are approaching good CWV thresholds, we can formulate a kind of “continuous CWV assessment”.

(I want to triply emphasize and underline that this “continuous CWV assessment” is something I’m making up right now for visualizing data because it’s a convenient way to collapse three dimensions into one, and is not in any way a real thing or something anyone (AFAIK) has considered using for anything ever)

For each metric:

  • a value in the “good” range gets 100
  • a value in the “poor” range gets 0
  • a value in the “needs improvement” range is put on a linear ramp between 0 and 100

The overall assessment is the mean of all the metric assessments.

So for instance, a page with an LCP of 3.25 seconds gets a 50 for LCP because that’s half of the way between the 2.5-second “good” threshold and the 4-second “poor” threshold. If the page fully meets the FID “good” threshold but has a “poor” CLS it would get a 100 and a 0 for each of those metrics, respectively, and its overall score would be (50 + 100 + 0) / 3 = 50.

Any attempt to summarize three such different things as LCP, FID, and CLS in a single number is going to be difficult and have tradeoffs, but I like that a page struggling with one metric doesn’t have its efforts in the other two metrics disregarded, and I like that it recognizes that “almost good” isn’t the same as “really poor”.

If we then look at various percentiles of this assessment at each Lighthouse score, we get the following results:

This is a complementary graph to the earlier ones, revealing what’s going on with pages not simultaneously meeting all three thresholds. One place this corresponds with the first graph in this post is that p50 here reaches 100 on the assessment (so, “good” on all three CWV thresholds) at about a Lighthouse score of 70, which in the original graph is when the “meets all three” line passes 50%.

We can see that at all these percentiles, even across the groups of pages most struggling to meet the CWVs, there is a definite improvement in field numbers as Lighthouse scores improve. On the other hand, there are still pages in the 99th percentile barely budging past meeting a single CWV that Lighthouse is giving a 90+ score to.

Finally I wanted to see if we could bridge the two visualizations of the relationship between field CWV and Lighthouse scores. If we flip this graph around to make a kind of stacked bar graph at each Lighthouse score, showing the percent of pages at that LH score that scored 100 on the CWV “continuous assessment”, the percentage that scored 99, 98, etc, we get something like:

The blue line is the same boundary as the blue line in the first graph: the percentage of pages meeting all the CWV thresholds below the line and missing at least one above it. The area above the dotted green line is the pages with no “good” CWV thresholds met at that Lighthouse score. The gradations between the two lines show the gradual transition between them.

What is Lighthouse missing?

Despite the strong connection between Lighthouse scores and pages meeting the CWV thresholds, there are still a decent number of pages that Lighthouse is giving a good score that aren’t meeting the “good” threshold on one or more of the CWV metrics. Of the pages that got a 90+ in Lighthouse in September, 43% didn’t meet one or more CWV threshold. So what kind of problems is Lighthouse missing?

Some of the discrepancy is expected due to testing, machine, network, location, and especially site-content variability. That’s not measurement error, it’s real variability due to existing in the real world, and a single Lighthouse run on a single machine, in a single place, at a single time isn’t going to be able to capture it all. However there are undoubtedly signals that Lighthouse is missing that could be useful to improve accuracy.

Of those 43% of Lighthouse 90+ pages with not-quite-good Core Web Vitals:

  • 46.6% do not meet only the FID threshold
  • 30.8% do not meet only the LCP threshold
  • 8.9% do not meet only the CLS threshold
  • 6.6% do not meet LCP and CLS
  • 4.8% do not meet LCP and FID
  • 1.8% do not meet CLS and FID
  • 0.6% do not meet all three thresholds

There’s a ton of data here to still dig into, but I wanted to include one signal we found that Lighthouse was missing:

Missing the FID threshold

It turns out that there’s a very obvious cause for a large portion of the pages missing the FID threshold: they don’t include a mobile <viewport> and so incur a heavy FID penalty (up to 300ms) due to double-tap-to-zoom behavior. The What’s new in Lighthouse 8.4 post explains this a bit more in depth, but because Lighthouse doesn’t simulate user clicks today, this is an interaction delay that was previously invisible to Lighthouse.

Luckily this is usually a simple fix for pages that are still actively maintained, and Lighthouse already looks for a mobile meta viewport, so it can be better incorporated into performance results. As explained in the What’s new post, a missing viewport is now highlighted in the Lighthouse performance section, and in future versions of Lighthouse there may be a perf score penalty applied to make it clear that page performance suffers without a mobile viewport.

The Lighthouse/CWV trend improves considerably if looking at the 93% of pages in this dataset with a mobile viewport:

The rate of passing all three CWV peaks at 77% at a Lighthouse score of 99 in this group, and of the pages with a mobile viewport and Lighthouse score of 90+, only 29% fail to meet at least one CWV threshold. Here’s the continuous view updated to only include pages with a mobile viewport:

Conclusion

To answer the question posed at the beginning of this post: this data does suggest that a site with a lower Lighthouse score is more likely to see users struggle with load performance, responsiveness, or content stability compared to a site with a higher Lighthouse score.

However, it’s also clear that for some pages, there are issues troubling performance in the field that Lighthouse isn’t (yet?) able to capture. I’d love to discuss any ideas folks have for further exploring this space, and in the meantime, remember to check both lab and field data if you’re looking to accurately assess your page’s performance :‍)

Source Queries

More and more I’ve been enjoying using numpy/matplotlib in a notebook to explore HTTP Archive data (using pandas.read_gbq as the connector to BigQuery), but this also makes it a little harder to document query workflows since most of the computation moves to python.

To join HTTP Archive’s CrUX and Lighthouse data, lately I’ve been starting with creating two temporary tables to save a lot of time and $$:

# crux_values_opt_fid_09_2021
SELECT * FROM
(SELECT
    url,
    
    CAST(JSON_EXTRACT_SCALAR(payload, "$._CrUX.metrics.first_contentful_paint.histogram[0].density") AS FLOAT64) AS pct_good_fcp,
    CAST(JSON_EXTRACT_SCALAR(payload, "$._CrUX.metrics.first_contentful_paint.percentiles.p75") AS FLOAT64) AS fcp_p75,

    CAST(JSON_EXTRACT_SCALAR(payload, "$._CrUX.metrics.largest_contentful_paint.histogram[0].density") AS FLOAT64) AS pct_good_lcp,
    CAST(JSON_EXTRACT_SCALAR(payload, "$._CrUX.metrics.largest_contentful_paint.percentiles.p75") AS FLOAT64) AS lcp_p75,

    # If density is 0, the `density` property is dropped, so explicitly set to 0 if FID is defined but density is not.
    CASE
      WHEN JSON_EXTRACT(payload, "$._CrUX.metrics.first_input_delay.histogram") IS NULL
        THEN NULL
      WHEN JSON_EXTRACT_SCALAR(payload, "$._CrUX.metrics.first_input_delay.histogram[0].density") IS NULL
        THEN 0.0
      ELSE
        CAST(JSON_EXTRACT_SCALAR(payload, "$._CrUX.metrics.first_input_delay.histogram[0].density") AS FLOAT64)
      END AS pct_good_fid,
    CAST(JSON_EXTRACT_SCALAR(payload, "$._CrUX.metrics.first_input_delay.percentiles.p75") AS FLOAT64) AS fid_p75,
    
    CAST(JSON_EXTRACT_SCALAR(payload, "$._CrUX.metrics.cumulative_layout_shift.histogram[0].density") AS FLOAT64) AS pct_good_cls,
    CAST(JSON_EXTRACT_SCALAR(payload, "$._CrUX.metrics.cumulative_layout_shift.percentiles.p75") AS FLOAT64) AS cls_p75

  FROM
    `httparchive.pages.2021_09_01_mobile`)
WHERE
  pct_good_lcp IS NOT NULL AND lcp_p75 IS NOT NULL AND
  pct_good_cls IS NOT NULL AND cls_p75 IS NOT NULL

and

# lh_extract_scores_2021_09_01
SELECT * FROM (
  SELECT
    JSON_EXTRACT_SCALAR(report, '$.finalUrl') AS final_url,
    JSON_EXTRACT_SCALAR(report, '$.runtimeError.code') AS runtime_error_code,
    JSON_EXTRACT_SCALAR(report, '$.lighthouseVersion') AS lh_version,
    CAST(JSON_EXTRACT_SCALAR(report, '$.categories.performance.score') AS FLOAT64) AS performance_score,
    
    CASE
      WHEN CAST(JSON_EXTRACT_SCALAR(report, '$.audits.viewport.score') AS FLOAT64) = 1
        THEN TRUE
      WHEN CAST(JSON_EXTRACT_SCALAR(report, '$.audits.viewport.score') AS FLOAT64) = 0
        THEN FALSE
      ELSE
        NULL
     END AS has_mobile_viewport,
  FROM `httparchive.lighthouse.2021_09_01_mobile`
)
WHERE runtime_error_code IS NULL
  AND performance_score IS NOT NULL
  # HTTP Archive sometimes falls back to ancient versions of LH for a handful of runs. Ignore those.
  AND REGEXP_CONTAINS(lh_version, r"^8\.\d\.\d$")

The analysis in this post can then be built on the following query:

SELECT * EXCEPT (url)
FROM (
  SELECT
    url,
    lcp_p75,
    cls_p75,
    fid_p75,
    lcp_p75 <= 2500 AS has_good_lcp,
    cls_p75 <= 0.1 AS has_good_cls,
    # Treat FID as optional if insufficient data at URL-level.
    fid_p75 IS NULL OR fid_p75 <= 100 AS has_good_fid,
  FROM `project.dataset.crux_values_opt_fid_09_2021`
) INNER JOIN (
  SELECT
    final_url AS url,
    CAST(performance_score * 100 AS INT64) AS lh_perf_score,
    has_mobile_viewport,
  FROM `project.dataset.lh_extract_scores_2021_09_01`
)
USING (url)
ORDER BY lh_perf_score

3 Likes

Very interesting, thank you. Why all the disclaimers about generating a composite score from RUM data? If we accept the validity of generating a score from lighthouse metrics, what is so bad about a score based on field data?

Fair question :slight_smile: I didn’t mean it was a bad idea, I just didn’t want anyone to think this post was a proposal for such a score instead of one attempt to get a handle on the relationship between lab and field data.

1 Like