What is service worker size by cdn?

Trying to take a look at service worker usage by CDN provider, and ran the following queryy:

SELECT
  _cdn_provider,
  respSize
FROM
  `httparchive.almanac.service_workers`
JOIN
  `httparchive.almanac.requests`
USING (page),
  UNNEST([10, 25, 50, 75, 90]) AS percentile
ORDER BY 
_cdn_provider

Unfortunately, I’m having a hard time deciphering the results. It shows many different instances of each CDN with varying values, so I’m not sure how to handle that. Also, is each value the total amount of MB served over service workers by the given CDN?

Also, the USING… percentile portion is there unnecessarily, but I cannot find a substitute to put there so that it runs properly. I started with the PWA report’s query code for SW size by client (desktop or mobile) but I tried altering it to be by CDN here.

Hi @slane. A CDN could serve many hundreds or thousands of service worker files, so your query needs to aggregate these into some kind of summary metric. The percentile field could be used there to see multiple summary metrics of the distribution of service worker file sizes. And when you calculate aggregate metrics, you need to also GROUP BY a field, which in this case is the _cdn_provider field. One other thing is that the url and client of the service worker also needs to be joined in the USING clause so you can get the response size for that specific file.

Also, is each value the total amount of MB served over service workers by the given CDN?

The respSize values are in bytes and represent the transfer size (eg gzipped).

Here’s a query that gets the median service worker response size per CDN:

SELECT
  _cdn_provider,
  APPROX_QUANTILES(respSize, 1000)[OFFSET(500)] AS median_service_worker_size
FROM
  `httparchive.almanac.service_workers`
JOIN
  `httparchive.almanac.requests`
USING
  (client, page, url)
WHERE
  client = 'desktop'
GROUP BY
  _cdn_provider
ORDER BY 
  _cdn_provider

I’ve also limited the results to “desktop” in this case because some service workers may be present in both desktop and mobile results.

_cdn_provider median_service_worker_size
1184
Akamai 350
Amazon CloudFront 1184
Amazon CloudFront, Google 71
Amazon CloudFront, Myra Security CDN, Incapsula 424
Azion 1248
BunnyCDN 8979
CDN 576
CDNetworks 613
Cedexis 7014
ChinaCache 33
ChinaNetCenter 4736
Cloudflare 533
Cloudflare, Amazon CloudFront 1478
Cloudflare, Fastly 470
Cloudflare, Google 1781
Cloudflare, Myra Security CDN, Incapsula 666
Edgecast 630
Edgecast, Google 3087
Edgecast, Myra Security CDN, Incapsula 8912
Fastly 1819
GoCache 102
Google 1239
Highwinds 744
Incapsula 495
Instart Logic 839
KeyCDN 1132
Level 3 444
Limelight 1417
Microsoft Azure 3065
Myra Security CDN 12754
Myra Security CDN, Incapsula 608
Myra Security CDN, Zenedge 898
Netlify 1877
OVH CDN 619
OnApp 7066
Optimal CDN 520
Rackspace 334
StackPath 998
Sucuri Firewall 470
SwiftCDN 399
XLabs Security 263
Zenedge 361
1 Like

Thanks so much for such a detailed response! Definitely clears most of it up. On a related note, could you explain what ‘blinkFeatureFirstUsed.Features.ServiceWorkerControlledPage’ represents?

It’s found in any of the pages files within the payload. I extracted that because I saw it being used in https://httparchive.org/reports/progressive-web-apps for the main plot there. While running a query for this metric by CDN, I get some numbers ranging from roughly -5000 to 10000, so any clarity on what this means would be much appreciated.

Thanks!

Any page with ServiceWorkerControlledPage has a service worker and the corresponding value is the timestamp it was detected. Not sure why it would be negative! @tomayac any idea?

Thanks again for the clarification. And in your first response, I just took a closer look at the output and noticed that many CDNs have multiple results. Shouldn’t each CDN have just one median?

As for the negative results, here is my query for anyone who wishes to reproduce it:
(This certainly isn’t the most optimal way of doing such a query, but it looks like it’s functioning properly)

#standardSQL
SELECT 
t1.url, t2.cdn, t1.service1
FROM
(SELECT
A.url,
JSON_EXTRACT(payload, '$._blinkFeatureFirstUsed.Features.ServiceWorkerControlledPage') as service1,
FROM `httparchive.pages.2018_04_15_desktop` as A) as t1 

INNER JOIN

(SELECT
B.url, B.cdn,
FROM `httparchive.summary_pages.2018_04_15_desktop` as B
) as t2
ON t1.url = t2.url
WHERE
service1 IS NOT NULL 
AND
cdn IS NOT NULL
ORDER BY
service1 ASC

I think this is set by WebPageTest. @patmeenan can confirm but to my knowledge this happens when a resource is served through multiple proxies. I wouldn’t expect these cases to happen too often. For your purposes maybe you only want to look at the edge CDN, which I assume is the first one in the list. You could also add a COUNT(0) AS frequency field to the query to get a sense of the number of service workers per CDN.

Thanks so much again for all your help, much appreciated! I’ll just take the edge for now but the frequency is good to know for future use.