Tracking Page Weight Over Time


#1

As of July 2017, the “average” page weight is 3MB. @Tammy wrote an excellent blog post about HTTP Archive page stats and trends. Last year @igrigorik published an analysis on page weight using CDF plots. And of course, we can view the trends over time on the HTTP Archive trends page. Since this is all based on HTTP Archive data, I thought I’d start a thread here to continue the discussion on how to gauge the increase in page weight over time.

To avoid falling into the trap of averages, I decided to run a query that will show us the trend for not just the average, but also the median and various percentiles. I created a Standard SQL query to extract this data.

But first a few notes about this query:

  • I used a wildcard table name. However since BigQuery doesn’t support wildcard of the pattern httparchive.runs.20*_pages I decided to use httparchive.runs.20* with a WHERE clause on _TABLE_SUFFIX LIKE '%_pages' to limit the amount of data processed.
  • I created a user defined JavaScript function that converts the _TABLE_SUFFIX variable to a mm/dd/yyyy format. This made it easier to chart the results after exporting the data without having to spend time manipulating the label field later.
  • I’m specifically querying for results where the bytesTotal is > 0 bytes. Since March 2017 there have been ~3000 sites per month that are logged with 0 byte responses and I didn’t want that to influence the trend.

Here’s the query :

CREATE TEMPORARY FUNCTION tableid_to_date(tableid STRING)
RETURNS STRING
LANGUAGE js AS """
  try {
    var parts = tableid.split('_');    
    date_string = parts[1] + '/' + parts[2] + '/20' + parts[0];
    return date_string;
  } catch (e) {
    return '';
  }
""";

SELECT tableid_to_date(_TABLE_SUFFIX) as Date, count(*) as Sites,
       APPROX_QUANTILES(bytesTotal/1024, 100)[SAFE_ORDINAL(25)] AS `Pct25th`,     
       APPROX_QUANTILES(bytesTotal/1024, 100)[SAFE_ORDINAL(50)] AS `Median`,      
       APPROX_QUANTILES(bytesTotal/1024, 100)[SAFE_ORDINAL(75)] AS `Pct75th`,
       APPROX_QUANTILES(bytesTotal/1024, 100)[SAFE_ORDINAL(85)] AS `Pct85th`,
       APPROX_QUANTILES(bytesTotal/1024, 100)[SAFE_ORDINAL(95)] AS `Pct95th`,
       AVG(bytesTotal/1024) as Average
FROM `httparchive.runs.20*`
WHERE  _TABLE_SUFFIX LIKE '%_pages' AND bytesTotal > 0
GROUP BY Date

When viewed on a line chart, it’s interesting to note that on May 1, 2017 the average page weight increased considerably, while the median, 75th percentile and even 85th percentile dropped. The 95th percentile continued to rise, indicating that the recent surge over 3MB was largely a result of 5% of the largest pages.

To explore this deeper I decided to graph the rate of change for each of these metrics from 2016 until now. While change in page weight over time is very gradual, there have been a few noticeable dates where the average page weight increased. For example, in May 2016 there was an interesting jump in page weight for ~ half of the pages. The jump on May 1st, 2017 continues to be of interest because it seems to have been triggered by an increased in the largest pages.
image

If we step back and look at this as a yearly trend, it seems that since 2016 the page weight growth across some websites has increased at a much slower rate. In fact 50% of sites may have actually managed to reduce their page weight slightly. This is fantastic for those sites - and further investigation can be done to see how they are slowing growth (hint: let’s discuss below!). Another ~35% of sites are still showing a slowdown in the rate that their page weights increase. And then there’s the 15%'ers…

There’s so much more that can be done to analyze this data. Let’s continue the discussion below, go beyond the average page weight and extract some deeper insight from this data!


Font transfer size & font requests
#2

as I said on Twitter great work, this is really fascinating to me albeit a tad over my head. Do you think the proliferation of CMS"s and eCommerce frameworks (plugins, scripts) etc is to blame? Or do you think this is just a symptom of web 2.0?

I wonder if these page jumps are due to large retailers adding popups / scripts / or popup videos for example to their websites for specials they are running and/or holidays?

A lot of what I do deals with page and site speed and I just see so many ridiculous examples every day. Pages that have 20 images each one 2000 x 2000 pixels (displayed at 100x100) 1Mb each, vidoes, 20 active plugins etc. It is great seeing AMP and other projects from FB and Google but I really think its all about developers needing to do their part to reduce page sizes. After all, its all about global warming, right? :slight_smile:


#3

I suspect both of those are could be to changes (or faults) in our pipeline:

  • HTTP Archive is now powered by Chrome - circa March/May 2016. Have a feeling this can explain the difference: we got different content when fetching with Chrome UA – sadly, UA sniffing is a thing.
  • ~May 2017 we moved to Linux agents, which resulted in some hiccups.

/cc @rviscomi @patmeenan for sanity check. That aside…


And that’s why it’s critical that we look at the full distribution, instead of just looking at averages. Awesome analysis Paul and thanks for digging into this data! Lots more to explore here.