Only just recently found out about the httparchive (through the Almanac plans) I got excited to also start digging into archive/lighthouse data and see if I could find something nice.
Visualize how front-end performance technologies (pre-load, pre-catch, http/2, compression) are adopted over time.
https://docs.google.com/spreadsheets/d/1v5ylvMbIQKAWgXiRerA84hR_kyryo7WStGruegv9SUQ/edit?usp=sharing (has a chart… only need to understand the Lighthouse scoring system better, see SQL below)
https://docs.google.com/spreadsheets/d/1LQWG4671sjYO0k0VlyCojAAV-ukLCgu54SoYVAi8IMA/edit?usp=sharing (has a chart… only not much to see ~2% variation, maybe use averages?)
Visualize whether TTFB + FMP take less % to complete over total load time.
https://docs.google.com/spreadsheets/d/1cqMKGAhWU5kqLLvIO1Y5uRce0mdLtTRGZzmORA9XdCI/edit?usp=sharing (has a chart… only also needs some work)
(all three limited to the top 10k hosts)
This topic will further log my progress and challenges, maybe helping out other starters:
- It took some time for me to accept the limitations of all data being related to the homepage-only. E.g. for flagging technologies I expected the whole host. Hard to then make a general statement about tech x powering % of the web.
- It took some time for me to understand the various Big Query tables, their contents and more importantly their differences.
- I miss the Alexa rank for urls. While 4M urls is nice, the limiting data to the top [x] could be more representative. It takes expensive SELECT hacks to get the rank value back.
- Lighthouse’s SEO tests aren’t very insightful.
- BigQuery Standard SQL is actually pretty accessible. The UI tool works great especially good debugging details. But having a “This query will process 3.49 TB when run.” notice while just started learning a system is a mental (and financial) threshold.
Idea #1 SQL: