How are JavaScript files being served and how many are there?


#1

How many scripts are being fetched from the non-HTML / non-root domain?

SELECT
 NTH(50, quantiles(num_scripts,101)) median,
 NTH(75, quantiles(num_scripts,101)) seventy_fifth,
 NTH(90, quantiles(num_scripts,101)) ninetieth,
 NTH(95, quantiles(num_scripts,101)) ninety_fifth
FROM (
  SELECT req.pageid, COUNT(req.url) as num_scripts
  FROM [httparchive:runs.latest_requests] as req JOIN (  
    SELECT HOST(url) self, pageid
    FROM [httparchive:runs.latest_pages]
  ) as pages ON pages.pageid = req.pageid
  WHERE HOST(req.url) != pages.self
    AND LOWER(mimeType) CONTAINS "script"
  GROUP BY req.pageid
)

Based on June 2013 data:

We can also ask the reverse (just remove the ! in the WHERE clause), which is to say… How many scripts are being loaded from the same host?

Are sites sharding scripts on sub-hostnames of HTML domain?

SELECT
 NTH(50, quantiles(num_scripts,101)) median,
 NTH(75, quantiles(num_scripts,101)) seventy_fifth,
 NTH(90, quantiles(num_scripts,101)) ninetieth,
 NTH(95, quantiles(num_scripts,101)) ninety_fifth
FROM (
  SELECT req.pageid, COUNT(req.url) as num_scripts
  FROM [httparchive:runs.latest_requests] as req JOIN (  
    SELECT DOMAIN(url) self, pageid
    FROM [httparchive:runs.latest_pages]
  ) as pages ON pages.pageid = req.pageid
  WHERE DOMAIN(req.url) = pages.self
    AND LOWER(mimeType) CONTAINS "script"
  GROUP BY req.pageid
)

A few more scripts in the tail, showing that some sites are indeed sharding on root domain…

So the median website has 13 scripts, 5 from the same host as the main page, and 8 from some other host. We now know the number of scripts, but from how many distinct hosts are they being fetched?

SELECT
 NTH(50, quantiles(num_hosts,101)) median,
 NTH(75, quantiles(num_hosts,101)) seventy_fifth,
 NTH(90, quantiles(num_hosts,101)) ninetieth,
 NTH(95, quantiles(num_hosts,101)) ninety_fifth
FROM (
 SELECT req.pageid, COUNT(DISTINCT HOST(req.url), 1000000) as num_hosts
 FROM [httparchive:runs.latest_requests] as req JOIN (  
   SELECT HOST(url) self, pageid
   FROM [httparchive:runs.latest_pages]
 ) as pages ON pages.pageid = req.pageid
 WHERE HOST(req.url) != pages.self
   AND LOWER(mimeType) CONTAINS "script"
 GROUP BY req.pageid
)

Looking at the median, the 8 scripts requested on hosts different than the main page are requested across 5 different domains. Yikes!


#2

It’s a pity there’s no easy way to dig into the pages to understand, when and how the scripts are being loaded i.e. are they async, are they being loaded late enough in the page to not affect the visitors experience.

I can probably guess the answers to both questions but would like a way of proving it.


#3

@andydavies actually, I think we could get some pretty good estimates. Not sure if the time resolution is good enough, but pages table contains the startedDateTime, plus onload and fullyLoaded. Similarly, we have the startedDateTime on requests… So, in theory, we could use the offsets to determine when the request is being dispatched - e.g. after onload or before.


#4

Sanity check, wouldn’t the median be the 51st record of 101 records?


#5

In retrospect, the whole 101 bit is silly, you should probably ignore it. The context is:
https://cloud.google.com/bigquery/docs/reference/legacy-sql#nth


#6

I think it still makes sense. The QUANTILES docs say to use 101 for percentiles. I’ve typically used NTH(501, QUANTILES(expr, 1001)) for added accuracy.