Tracking Page Weight Over Time

As of July 2017, the “average” page weight is 3MB. @Tammy wrote an excellent blog post about HTTP Archive page stats and trends. Last year @igrigorik published an analysis on page weight using CDF plots. And of course, we can view the trends over time on the HTTP Archive trends page. Since this is all based on HTTP Archive data, I thought I’d start a thread here to continue the discussion on how to gauge the increase in page weight over time.

To avoid falling into the trap of averages, I decided to run a query that will show us the trend for not just the average, but also the median and various percentiles. I created a Standard SQL query to extract this data.

But first a few notes about this query:

  • I used a wildcard table name. However since BigQuery doesn’t support wildcard of the pattern httparchive.runs.20*_pages I decided to use httparchive.runs.20* with a WHERE clause on _TABLE_SUFFIX LIKE '%_pages' to limit the amount of data processed.
  • I created a user defined JavaScript function that converts the _TABLE_SUFFIX variable to a mm/dd/yyyy format. This made it easier to chart the results after exporting the data without having to spend time manipulating the label field later.
  • I’m specifically querying for results where the bytesTotal is > 0 bytes. Since March 2017 there have been ~3000 sites per month that are logged with 0 byte responses and I didn’t want that to influence the trend.

Here’s the query :

CREATE TEMPORARY FUNCTION tableid_to_date(tableid STRING)
RETURNS STRING
LANGUAGE js AS """
  try {
    var parts = tableid.split('_');    
    date_string = parts[1] + '/' + parts[2] + '/20' + parts[0];
    return date_string;
  } catch (e) {
    return '';
  }
""";

SELECT tableid_to_date(_TABLE_SUFFIX) as Date, count(*) as Sites,
       APPROX_QUANTILES(bytesTotal/1024, 100)[SAFE_ORDINAL(25)] AS `Pct25th`,     
       APPROX_QUANTILES(bytesTotal/1024, 100)[SAFE_ORDINAL(50)] AS `Median`,      
       APPROX_QUANTILES(bytesTotal/1024, 100)[SAFE_ORDINAL(75)] AS `Pct75th`,
       APPROX_QUANTILES(bytesTotal/1024, 100)[SAFE_ORDINAL(85)] AS `Pct85th`,
       APPROX_QUANTILES(bytesTotal/1024, 100)[SAFE_ORDINAL(95)] AS `Pct95th`,
       AVG(bytesTotal/1024) as Average
FROM `httparchive.runs.20*`
WHERE  _TABLE_SUFFIX LIKE '%_pages' AND bytesTotal > 0
GROUP BY Date

When viewed on a line chart, it’s interesting to note that on May 1, 2017 the average page weight increased considerably, while the median, 75th percentile and even 85th percentile dropped. The 95th percentile continued to rise, indicating that the recent surge over 3MB was largely a result of 5% of the largest pages.

To explore this deeper I decided to graph the rate of change for each of these metrics from 2016 until now. While change in page weight over time is very gradual, there have been a few noticeable dates where the average page weight increased. For example, in May 2016 there was an interesting jump in page weight for ~ half of the pages. The jump on May 1st, 2017 continues to be of interest because it seems to have been triggered by an increased in the largest pages.
image

If we step back and look at this as a yearly trend, it seems that since 2016 the page weight growth across some websites has increased at a much slower rate. In fact 50% of sites may have actually managed to reduce their page weight slightly. This is fantastic for those sites - and further investigation can be done to see how they are slowing growth (hint: let’s discuss below!). Another ~35% of sites are still showing a slowdown in the rate that their page weights increase. And then there’s the 15%'ers…

There’s so much more that can be done to analyze this data. Let’s continue the discussion below, go beyond the average page weight and extract some deeper insight from this data!

4 Likes

as I said on Twitter great work, this is really fascinating to me albeit a tad over my head. Do you think the proliferation of CMS"s and eCommerce frameworks (plugins, scripts) etc is to blame? Or do you think this is just a symptom of web 2.0?

I wonder if these page jumps are due to large retailers adding popups / scripts / or popup videos for example to their websites for specials they are running and/or holidays?

A lot of what I do deals with page and site speed and I just see so many ridiculous examples every day. Pages that have 20 images each one 2000 x 2000 pixels (displayed at 100x100) 1Mb each, vidoes, 20 active plugins etc. It is great seeing AMP and other projects from FB and Google but I really think its all about developers needing to do their part to reduce page sizes. After all, its all about global warming, right? :slight_smile:

I suspect both of those are could be to changes (or faults) in our pipeline:

  • HTTP Archive is now powered by Chrome - circa March/May 2016. Have a feeling this can explain the difference: we got different content when fetching with Chrome UA – sadly, UA sniffing is a thing.
  • ~May 2017 we moved to Linux agents, which resulted in some hiccups.

/cc @rviscomi @patmeenan for sanity check. That aside…


And that’s why it’s critical that we look at the full distribution, instead of just looking at averages. Awesome analysis Paul and thanks for digging into this data! Lots more to explore here.

1 Like

Revisiting this topic with 2021 data. Back when I wrote this query, the pages summary table was in the runs dataset. Now it is stored in summary_pages. I’ve updated this query with the following changes:

  • Point to the correct summary tables
  • I’ve simplified how I was extracting the date from the table suffix so that it no longer requires JavaScript.
  • I’ve updated the approximate aggregate function to increase the precision (since I noticed that the results from my previous query do not match what is reported in the curated stats).

Here’s the updated SQL:

SELECT 
  SUBSTR(_TABLE_SUFFIX,12) AS client,
  REPLACE(SUBSTR(_TABLE_SUFFIX, 0, 10), '_', '-') AS yyyymmdd, 
  COUNT(*) AS sites,
  ROUND(APPROX_QUANTILES(bytesTotal, 1001)[OFFSET(101)] / 1024, 2) AS p10,
  ROUND(APPROX_QUANTILES(bytesTotal, 1001)[OFFSET(251)] / 1024, 2) AS p25,
  ROUND(APPROX_QUANTILES(bytesTotal, 1001)[OFFSET(501)] / 1024, 2) AS p50,
  ROUND(APPROX_QUANTILES(bytesTotal, 1001)[OFFSET(751)] / 1024, 2) AS p75,
  ROUND(APPROX_QUANTILES(bytesTotal, 1001)[OFFSET(851)] / 1024, 2) AS p85,
  ROUND(APPROX_QUANTILES(bytesTotal, 1001)[OFFSET(901)] / 1024, 2) AS p90,
  ROUND(APPROX_QUANTILES(bytesTotal, 1001)[OFFSET(951)] / 1024, 2) AS p95,
FROM 
  `httparchive.summary_pages.*`
WHERE   
  bytesTotal > 0
GROUP BY
  client,
  yyyymmdd
ORDER BY
  client,
  yyyymmdd

One important thing to note is that the dataset size has changed a few times over the years. Originally, the dataset was using the Alexa top sites. Back in 2010 there were ~16k desktop sites measured. That increased as testing capacity increased, upwards of 300,000 sites in 2012 and 500,000 sites in 2014. In 2018 the dataset changed from the Alexa list to the Chrome User Experience report and that brought us to over 1 million sites. From 2019 onwards, the dataset continued to grow to the point where it’s over 7 million sites. I’ve written about the growth of the web from a CrUX perspective here as well.

In the 4 years since this post was published, the page weight has continued to increase linearly at each percentile. Here’s a breakdown of Desktop page weight by month.

as well as Mobile

More than 15% of mobile homepages (ie, roughly 1 million out of the 7 million sites tracked) are larger than 5MB in size!

Here’s the data and graphs in case you want to explore this some more - HTTP Archive Page Weight Percentiles - Google Sheets

2 Likes