Correlating Lighthouse scores to page-level CrUX data

This is repeating some of the above, but I wanted a chance to weigh in because I was on vacation last week :⟯

As a first approximation, assuming Lighthouse’s metrics are independent and are measured perfectly, the covariance between each metric and the lighthouse score would be the metric weights. Assuming unit variance for the gathered data, the Pearson correlations would be the weights as well (in LH v7 that would be correlations of 0.25 for LCP, 0.05 for CLS, and 0.25 for TBT).

Of course we don’t have unit variance (and the variance is different for each metric) and the metrics aren’t independent (e.g. TTI and TBT both include long task behavior, albeit different aspects of it), but it gives a starting point for expectations and it’s a good reminder that we don’t want the correlation between e.g. LCP and the LH score to be too close to 1. As Paul wrote above, too close to 1 would indicate the LH score is entirely explainable by a single metric and isn’t capturing multiple aspects of page load.

As an example:

  • site A has a p75 LCP of 2s while site B has a p75 LCP of 1.5s. Looking at just those numbers we’d want site B’s Lighthouse score to be higher to reflect the better LCP.

  • but if site A also had both a p75 FID and CLS of 0 while site B had a p75 FID of 500ms and a p75 CLS of 1, we’d almost certainly want B’s Lighthouse score to be lower, even though that would mean the correlation of LCP with the Lighthouse score would be negative as a result.

The Lighthouse score curves and metric weights are an attempt to systematize that kind of tradeoff and turn a set of heterogeneous metrics into a single scale of “worse” to “better” performance (and correlating with CWV isn’t the only goal of the LH score). There are definite downsides to flattening in that way, but that’s why the full breakdown of metrics/opportunities/diagnostics are important for anyone digging into the performance of a page beyond just the headline score.

On FID and choice of correlation coefficients

Pearson correlation isn’t a good choice for these samples because it tests only for a linear correlation between variables and there’s no reason to assume a linear relationship between the percentage of visitors passing a threshold on a single metric and a weighted average of multiple metrics, each with a non-linear scoring transformation applied. I’m not sure if we’d even want a linear relationship; we really just want the numbers to generally increase together.

Spearman’s rank correlation doesn’t require a linear relationship or assumptions about the underlying distributions that may not hold here. This is especially helpful with the somewhat degenerate pct_good_fid distribution in this sample (see the histogram in Paul’s post).

Using Spearman’s rank correlation gives approximately the same values for LCP and CLS (~0.53 and ~0.39), but a positive value for FID (about 0.16). Given TBT’s 0.25 weight and the differences between FID and TBT (TBT has no access to data about when users typically interact with the page, which is central to what makes FID FID, and it captures long tasks throughout the load of the page that aren’t part of FID), this seems fairly reasonable.

1 Like