Correlating Lighthouse scores to page-level CrUX data

The pages dataset contains a new field in the JSON-encoded HAR payloads: _CrUX. As you can guess, this represents the real-user field data from the Chrome UX Report (CrUX) dataset. This presents a really interesting opportunity for web transparency research because previously the only queryable dataset with field data from the web at large was the public CrUX dataset on BigQuery. One major limitation of that dataset is the fact that it’s only available at the origin level. This new field in HTTP Archive is at the URL level, made possible by WebPageTest invoking the CrUX API for every URL at test time. So what can we do with URL-level CrUX data in HTTP Archive?

Here’s a query that joins the URL-level CrUX field data with the lab-based Lighthouse data to try to answer questions about the correlations between the Lighthouse performance score and the Core Web Vitals performance.

SELECT
  COUNT(0) AS urls,
  CORR(lh_perf_score, pct_good_lcp) AS lcp_correlation,
  CORR(lh_perf_score, pct_good_fid) AS fid_correlation,
  CORR(lh_perf_score, pct_good_cls) AS cls_correlation
FROM (
  SELECT
    url,
    CAST(JSON_QUERY(report, "$.categories.performance.score") AS FLOAT64) AS lh_perf_score
  FROM
    `httparchive.lighthouse.2021_05_01_mobile`)
JOIN (
  SELECT
    url,
    CAST(JSON_QUERY(payload, "$._CrUX.metrics.largest_contentful_paint.histogram[0].density") AS FLOAT64) AS pct_good_lcp,
    CAST(JSON_QUERY(payload, "$._CrUX.metrics.first_input_delay.histogram[0].density") AS FLOAT64) AS pct_good_fid,
    CAST(JSON_QUERY(payload, "$._CrUX.metrics.cumulative_layout_shift.histogram[0].density") AS FLOAT64) AS pct_good_cls
  FROM
    `httparchive.pages.2021_05_01_mobile`)
USING
  (url)
urls lcp_correlation fid_correlation cls_correlation
7282377 0.49 -0.14 0.37

Surprisingly, the correlations for each CWV are weak to medium strength and for FID it’s actually a negative correlation, meaning that as the Lighthouse score goes up, real-user FID performance actually tends to go down a bit. Of all the metrics, this makes the most sense for FID because it’s the hardest to emulate in the lab.

This uses the Lighthouse version 7.0 weighting for the performance score, which is based on TBT as a proxy for FID. The score weights TBT and LCP the highest at 25%, with CLS weighted at 5%. The v8.0 weighting changes these metrics to 30%, 25%, and 15% respectively and we’ll have the v8.0 results to analyze at the end of June 2021. It’ll be interesting to see how the correlations change across versions.

One way to visualize this is with a scatterplot. Here’s a query that takes a sample of 1000 representative data points representing the Lighthouse score and CWV performance for each page:

SELECT
  * EXCEPT (i, n)
FROM (
  SELECT
    ROW_NUMBER() OVER (ORDER BY lh_perf_score) AS i,
    COUNT(0) OVER () AS n,
    lh_perf_score,
    pct_good_cls,
    pct_good_fid,
    pct_good_lcp
  FROM (
    SELECT
      url,
      CAST(JSON_QUERY(report, "$.categories.performance.score") AS FLOAT64) AS lh_perf_score
    FROM
      `httparchive.lighthouse.2021_05_01_mobile`)
  JOIN (
    SELECT
      url,
      CAST(JSON_QUERY(payload, "$._CrUX.metrics.largest_contentful_paint.histogram[0].density") AS FLOAT64) AS pct_good_lcp,
      CAST(JSON_QUERY(payload, "$._CrUX.metrics.first_input_delay.histogram[0].density") AS FLOAT64) AS pct_good_fid,
      CAST(JSON_QUERY(payload, "$._CrUX.metrics.cumulative_layout_shift.histogram[0].density") AS FLOAT64) AS pct_good_cls
    FROM
      `httparchive.pages.2021_05_01_mobile`)
  USING
    (url)
  WHERE
    lh_perf_score IS NOT NULL AND
    pct_good_cls IS NOT NULL AND
    pct_good_fid IS NOT NULL AND
    pct_good_lcp IS NOT NULL)
WHERE
  MOD(i, CAST(FLOOR(n / 1000) AS INT64)) = 0

View the results in a spreadsheet

The built-in trend line feature in Google Sheets corroborates the Pearson coefficients calculated directly from BigQuery, with the FID trend going down and the LCP trend slightly steeper than that of CLS.

So to sum up, the Lighthouse performance score is most closely correlated with good LCP values, but not very strongly. It correlates even less strongly with CLS, and slightly negatively with FID.

5 Likes

This is some super interesting analysis, @rviscomi!

The super weak FID correlation doesn’t surprise me at all—I’ve seen a very, very weak connection between Total Blocking Time and FID in sites I’ve profiled. Part of that, I’m sure, stems from the fact that FID is a pretty limited and overly optimistic metric at the moment.

The CLS correlation being low is also not too surprising given the big difference between measuring CLS in a synthetic environment (with a very limited time window) versus the CrUX measurement (until page visibility state changes).

One follow-up I think would be super interesting: how do the individual metrics as reported by LH correlate? (Ex: LH reported CLS correlation to good CLS in the wild). I’d expect a stronger correlation (since the LH score factors in a handful of different metrics, which probably dilutes the correlation a bit).

1 Like

This is an interesting analysis, but we spotted a few issues which probably make these correlation results less significant than it appears to many.

Breaking it down, there’s 3 problems: with the correlation target, with FID and with CLS.

The Correlation Target

There’ve been a lot of correlation studies looking at field and lab, but generally this happens at the metric level. (Like what Tim is getting at above).

The analysis above explores how each of the field metrics individually correlate to the lab Lighthouse Score. The Lighthouse score is a composite statistic, it’s weighted average of 6 scored metrics, including signals from Speed Index, TTI, and FCP. It’s intended to represent page load performance overall, not individual aspects of performance. Each component of the score is designed to capture a different aspect of page load, so a high correlation with any single metric would signal a failure of differentiating the metrics (otherwise we should replace all those metrics with just the one! :⟯

Taking CLS for example–as you pointed out, it’s weighted as just 5% of the LH7 score. So the Pearson coefficient of 0.36 is describing how well the field CLS correlates to a score that is 95% composed of non-CLS metric signals. LCP is, admittedly, a more fair example as it’s the highest weighted metric at 25%, but even there, we’re looking at a correlation of one apple to a fruit basket. :apple::mango::watermelon:

In other words, given two sites with different Lighthouse scores, we don’t expect the site with the lower score to always have a worse LCP (or lower pct_good_lcp). The LCP could be lower, but the site could also have the same LCP but a worse FCP, or CLS, or Speed Index. Looking for a high Pearson coefficient from CORR(lh_perf_score, pct_good_lcp) is expecting even more, that the Lighthouse score only moves in lockstep with pct_good_lcp and any changes in the other metrics either cancel each other out or are negligible, which we know is not true.

Correlating to CWV assessment

A similar study would be using the CWV assessment as a correlation target. (I don’t think it’s terribly meaningful, but it’s fairly equivalent as a target). I ran this and found some interesting results.

The Pearson’s coefficients for LCP and CLS are surprisingly similar when targeted against the CWV assessment compared to the LH Score.

"lcp_crux_to_lhscore_corr":            0.491,  
"lcp_crux_to_cwv_assessment_corr":     0.456,

"cls_crux_to_lhscore_corr":            0.368,  
"cls_crux_to_cwv_assessment_corr":     0.363,

"fid_crux_to_lhscore_corr":           -0.141,  
"fid_crux_to_cwv_assessment_corr":     0.208,  

The FID story is a bit different, but that deserves it’s own exploration…

Problems with FID

FID in CrUX is a peculiar metric. In Chrome 90, its values were artificially inflated in some cases due to double-tap-to-zoom. Also, it’s challenging due to how FID’s 75th percentile manifests–e.g. here’s the histogram in that sample of URLs:

Trying to take a Pearson correlation with a bimodal distribution will generally yield unhelpful results. Moreover, correlation coefficients really only make sense given context (see Anscombe’s quartet).

The negative correlation isn’t surprising, though, since it’s also found in CrUX data: FID also has a negative correlation with CrUX FCP and LCP. This makes a certain amount of sense, because one way of pushing off the first user interaction is to delay painting so user input happens after long tasks. However, we shouldn’t read too much into correlation/causation without causal analysis.

Different CLSs

CLS’s issue is straightforward enough. The data here was from m90 (and LH7), which is before lab and field CLS got the windowing adjustment. Without the adjustment, the field and lab definitions and results are substantially different. Now, thankfully, we expect to see better correlation between field CLS and lab CLS than before, but even still, the observation windows are completely different.


Summing up… There’s decent reasons why these FID and (old) CLS values don’t correlate well to the lab. And, we’re quite comfortable with Rick’s results because they’re about what you’d expect when correlating a field metric to a composite score, and a higher correlation is undesirable.

Thanks Brendan Kenny and Patrick Hulce for their thoughts and contributions. Big thanks to Rick for letting us get in the weeds here. ;⟯

5 Likes

This is repeating some of the above, but I wanted a chance to weigh in because I was on vacation last week :⟯

As a first approximation, assuming Lighthouse’s metrics are independent and are measured perfectly, the covariance between each metric and the lighthouse score would be the metric weights. Assuming unit variance for the gathered data, the Pearson correlations would be the weights as well (in LH v7 that would be correlations of 0.25 for LCP, 0.05 for CLS, and 0.25 for TBT).

Of course we don’t have unit variance (and the variance is different for each metric) and the metrics aren’t independent (e.g. TTI and TBT both include long task behavior, albeit different aspects of it), but it gives a starting point for expectations and it’s a good reminder that we don’t want the correlation between e.g. LCP and the LH score to be too close to 1. As Paul wrote above, too close to 1 would indicate the LH score is entirely explainable by a single metric and isn’t capturing multiple aspects of page load.

As an example:

  • site A has a p75 LCP of 2s while site B has a p75 LCP of 1.5s. Looking at just those numbers we’d want site B’s Lighthouse score to be higher to reflect the better LCP.

  • but if site A also had both a p75 FID and CLS of 0 while site B had a p75 FID of 500ms and a p75 CLS of 1, we’d almost certainly want B’s Lighthouse score to be lower, even though that would mean the correlation of LCP with the Lighthouse score would be negative as a result.

The Lighthouse score curves and metric weights are an attempt to systematize that kind of tradeoff and turn a set of heterogeneous metrics into a single scale of “worse” to “better” performance (and correlating with CWV isn’t the only goal of the LH score). There are definite downsides to flattening in that way, but that’s why the full breakdown of metrics/opportunities/diagnostics are important for anyone digging into the performance of a page beyond just the headline score.

On FID and choice of correlation coefficients

Pearson correlation isn’t a good choice for these samples because it tests only for a linear correlation between variables and there’s no reason to assume a linear relationship between the percentage of visitors passing a threshold on a single metric and a weighted average of multiple metrics, each with a non-linear scoring transformation applied. I’m not sure if we’d even want a linear relationship; we really just want the numbers to generally increase together.

Spearman’s rank correlation doesn’t require a linear relationship or assumptions about the underlying distributions that may not hold here. This is especially helpful with the somewhat degenerate pct_good_fid distribution in this sample (see the histogram in Paul’s post).

Using Spearman’s rank correlation gives approximately the same values for LCP and CLS (~0.53 and ~0.39), but a positive value for FID (about 0.16). Given TBT’s 0.25 weight and the differences between FID and TBT (TBT has no access to data about when users typically interact with the page, which is central to what makes FID FID, and it captures long tasks throughout the load of the page that aren’t part of FID), this seems fairly reasonable.

1 Like

I’d like to weigh in from a more practical angle of how the Lighthouse (LH) score actually gets used, at least by us at Wix, and by our customers.

The way in which we prefer to use the LH score is for performance budgeting. That is, a single number that we can use as an indication that the performance of a new build is most likely equal or better than that of previous builds. In this regard, it does reflect an assumption that an improved LH score likely indicates that actual (field) performance will also improve.

Currently we also budget using total JS download size as we have found that downloading more JS generally translates to performance degradation, all else being equal.

It is possible that we may also decide to budget on a specific performance metric, e.g. LCP, in addition to budgeting on LH score and JS download size. If we do that, it will be in order to specifically focus on improving that particular aspect of our performance.

That said, our customers - people who use Wix to build their own websites - often do use LH / PSI score as the “be-all and end-all” of page performance. They often perform such analysis when they don’t yet have sufficient traffic for field data. But, based on feedback we’ve received, many of them don’t really understand the difference between field and lab data. (And the PSI layout, where the field data is in-between the score and the lab data which determines it, contributes to this misunderstanding IMO.) Moreover, because LH and PSI are Google tools, they often believe LH scores have direct impact on site ranking.

What we are seeing, in most cases where we compare the two, is that mobile LH score is generally significantly lower than what visitors on mobile devices actually experience.

In addition to all of the above, I do think that the FID metric is particularly problematic, both due to the way in which it it’s measured in the field, and due to the use of TBT and TTI as lab substitutes.

1 Like