How many sites need codesplitting?


#1

Yes, codesplitting is a great best practice for sites that combine all their scripts into a single, monolithic script. I wanted to find out: How many sites need codesplitting?

The hard part isn’t writing a query, it’s defining “need codesplitting”. Here are some filters I applied:

  • I only look at 1st party resources. In my experience teams don’t try to combine 3rd party scripts into their single payload.
  • I only look at sites with at least 200K of 1st party JS. I chose this as the threshold where combining into a single file would cause the problems codesplitting solves. Although, it’s easy to change that in the query and run other thresholds.

Given those filters, I created a histogram of the ratio of the largest 1st party script over the total 1st party JS size. That’s hard to understand so let’s look at the query and results and revisit the histogram.

Here’s the query:

SELECT biggest_percent_bucket, round(100*count(*)/303067) as percent_of_sites, count(*) as num
FROM (
  SELECT pageid, 
       max(JS_KB) as biggest_JS_KB, 
       sum(JS_KB) as total_JS_KB, 
       round(100 * max(JS_KB) / sum(JS_KB)) as biggest_percent_of_total,
       floor(0.9 + (10 * (max(JS_KB)/sum(JS_KB))))*10 as biggest_percent_bucket
  FROM (
    SELECT p.pageid as pageid,
       round(respSize/1024) as JS_KB
    FROM `httparchive.summary_pages.2018_10_01_desktop` as p,
         `httparchive.summary_requests.2018_10_01_desktop` as r
    WHERE p.pageid = r.pageid
          AND ( resp_content_type like "%application/javascript%" OR resp_content_type like "%text/javascript%" OR resp_content_type like "%application/x-javascript%" OR resp_content_type like "%application/json" )
          AND round(respSize/1024) > 0
          AND STRPOS(NET.REG_DOMAIN(r.url),REGEXP_EXTRACT(NET.REG_DOMAIN(p.url), r'([\w-]+)'))>0
  )
  GROUP BY pageid
)
WHERE total_JS_KB > 200
GROUP BY biggest_percent_bucket
ORDER BY biggest_percent_bucket asc

Here are the results:
image

The column definitions are:

  • “biggest_percent_bucket” is the percentage of the biggest 1st party script out of the total size of all 1st party scripts. This is broken into buckets 1-10, 11-20, etc.
  • “num” is the number of sites that fall into the bucket
  • “percent_of_sites” is the percent of sites that fall into the bucket

For example, the last row of the table says: 10% of sites have a monolithic 1st party script that represents more than 90% of the 1st party total JS size. The second-to-last row says: 5% of sites have a monolithic script that is 81-90% of the total JS size. Etc.

I’m not sure why my query produces a “0.0” row, but it’s only 11 sites so I didn’t worry about. Tips on doing better bucketing are appreciated.

If you’d like to explore other “JS total size” thresholds just edit the where condition “total_JS_KB > 200”. If you’d like to include 3rd party scripts just remove the where condition “STRPOS(NET.REG_DOMAIN(r.url),REGEXP_EXTRACT(NET.REG_DOMAIN(p.url)”.


#2

If max(JS_KB)/sum(JS_KB) < 0.01,
then (10 * (max(JS_KB)/sum(JS_KB)) < 0.1,
then 0.9 + (10 * (max(JS_KB)/sum(JS_KB))) < 1.0,
then floor(0.9 + (10 * (max(JS_KB)/sum(JS_KB)))) == 0.0


#3

This is fascinating. I wonder how much the distributions would differ if we focused on the top 1000 sites?

(I have a loose hypothesis that the long-tail of primarily blog/content sites don’t load up as much heavy script but could be wrong).


#4

I wish it was easy to do that but the “rank” field in HTTP Archive was disabled awhile back. It’s probably possible (see here) but someone with better SQL chops than I needs to take a swing.