Only just recently found out about the httparchive (through the Almanac plans) I got excited to also start digging into archive/lighthouse data and see if I could find something nice.
Three ideas:
-
Visualize how front-end performance technologies (pre-load, pre-catch, http/2, compression) are adopted over time.
https://docs.google.com/spreadsheets/d/1v5ylvMbIQKAWgXiRerA84hR_kyryo7WStGruegv9SUQ/edit?usp=sharing (has a chart… only need to understand the Lighthouse scoring system better, see SQL below) -
Migration from render-blocking Javascript in the head over time.
https://docs.google.com/spreadsheets/d/1LQWG4671sjYO0k0VlyCojAAV-ukLCgu54SoYVAi8IMA/edit?usp=sharing (has a chart… only not much to see ~2% variation, maybe use averages?) -
Visualize whether TTFB + FMP take less % to complete over total load time.
https://docs.google.com/spreadsheets/d/1cqMKGAhWU5kqLLvIO1Y5uRce0mdLtTRGZzmORA9XdCI/edit?usp=sharing (has a chart… only also needs some work)
(all three limited to the top 10k hosts)
This topic will further log my progress and challenges, maybe helping out other starters:
Challenges:
- It took some time for me to accept the limitations of all data being related to the homepage-only. E.g. for flagging technologies I expected the whole host. Hard to then make a general statement about tech x powering % of the web.
- It took some time for me to understand the various Big Query tables, their contents and more importantly their differences.
- I miss the Alexa rank for urls. While 4M urls is nice, the limiting data to the top [x] could be more representative. It takes expensive SELECT hacks to get the rank value back.
- Lighthouse’s SEO tests aren’t very insightful.
Experience:
- BigQuery Standard SQL is actually pretty accessible. The UI tool works great especially good debugging details. But having a “This query will process 3.49 TB when run.” notice while just started learning a system is a mental (and financial) threshold.
[WIP]
Idea #1 SQL:
CREATE TEMP FUNCTION isFound(json_array STRING)
RETURNS INT64
LANGUAGE js AS """
var arr = JSON.parse(json_array);
if(arr) {
if(arr.scoreDisplayMode == 'binary') {
return arr.rawValue === true ? 1 : 0;
} else if(arr.scoreDisplayMode == 'numeric') {
return arr.score;
}
}
return 0;
""";
SELECT
SUBSTR(_TABLE_SUFFIX, 0, 7) AS date,
COUNT(url) AS `total`,
COUNT(DISTINCT url) AS `distinct_total`,
SUM(isFound(JSON_EXTRACT(report, '$.audits.uses-rel-preload')) )AS `uses_rel_preload`,
SUM(isFound(JSON_EXTRACT(report, '$.audits.uses-rel-preconnect'))) AS `uses_rel_preconnect`,
SUM(isFound(JSON_EXTRACT(report, '$.audits.uses-http2'))) AS `uses_http2`,
SUM(isFound(JSON_EXTRACT(report, '$.audits.uses-text-compression'))) AS `uses_text_compression`,
SUM(isFound(JSON_EXTRACT(report, '$.audits.uses-long-cache-ttl'))) AS `uses_long_cache_ttl`,
SUM(isFound(JSON_EXTRACT(report, '$.audits.uses-responsive-images'))) AS `uses_responsive_images`,
SUM(isFound(JSON_EXTRACT(report, '$.audits.font-display'))) AS `font_display`,
SUM(isFound(JSON_EXTRACT(report, '$.audits.unminified-javascript'))) AS `unminified_javascript`,
SUM(isFound(JSON_EXTRACT(report, '$.audits.unminified-css'))) AS `unminified_css`
FROM
`httparchive.lighthouse.*`
WHERE
_TABLE_SUFFIX >= '2018_10_01'
AND ENDS_WITH(_TABLE_SUFFIX,
'_01_mobile')
AND NET.HOST(url) IN (
SELECT
NET.HOST(url) AS host
FROM
`httparchive.summary_pages.2018_06_01_desktop`
WHERE
rank < 10000
ORDER BY
rank ASC)
GROUP BY _TABLE_SUFFIX
ORDER BY _TABLE_SUFFIX ASC