Trying to take a look at service worker usage by CDN provider, and ran the following queryy:
SELECT
_cdn_provider,
respSize
FROM
`httparchive.almanac.service_workers`
JOIN
`httparchive.almanac.requests`
USING (page),
UNNEST([10, 25, 50, 75, 90]) AS percentile
ORDER BY
_cdn_provider
Unfortunately, I’m having a hard time deciphering the results. It shows many different instances of each CDN with varying values, so I’m not sure how to handle that. Also, is each value the total amount of MB served over service workers by the given CDN?
Also, the USING… percentile portion is there unnecessarily, but I cannot find a substitute to put there so that it runs properly. I started with the PWA report’s query code for SW size by client (desktop or mobile) but I tried altering it to be by CDN here.
Hi @slane. A CDN could serve many hundreds or thousands of service worker files, so your query needs to aggregate these into some kind of summary metric. The percentile field could be used there to see multiple summary metrics of the distribution of service worker file sizes. And when you calculate aggregate metrics, you need to also GROUP BY a field, which in this case is the _cdn_provider field. One other thing is that the url and client of the service worker also needs to be joined in the USING clause so you can get the response size for that specific file.
Also, is each value the total amount of MB served over service workers by the given CDN?
The respSize values are in bytes and represent the transfer size (eg gzipped).
Here’s a query that gets the median service worker response size per CDN:
SELECT
_cdn_provider,
APPROX_QUANTILES(respSize, 1000)[OFFSET(500)] AS median_service_worker_size
FROM
`httparchive.almanac.service_workers`
JOIN
`httparchive.almanac.requests`
USING
(client, page, url)
WHERE
client = 'desktop'
GROUP BY
_cdn_provider
ORDER BY
_cdn_provider
I’ve also limited the results to “desktop” in this case because some service workers may be present in both desktop and mobile results.
Thanks so much for such a detailed response! Definitely clears most of it up. On a related note, could you explain what ‘blinkFeatureFirstUsed.Features.ServiceWorkerControlledPage’ represents?
It’s found in any of the pages files within the payload. I extracted that because I saw it being used in https://httparchive.org/reports/progressive-web-apps for the main plot there. While running a query for this metric by CDN, I get some numbers ranging from roughly -5000 to 10000, so any clarity on what this means would be much appreciated.
Any page with ServiceWorkerControlledPage has a service worker and the corresponding value is the timestamp it was detected. Not sure why it would be negative! @tomayac any idea?
Thanks again for the clarification. And in your first response, I just took a closer look at the output and noticed that many CDNs have multiple results. Shouldn’t each CDN have just one median?
As for the negative results, here is my query for anyone who wishes to reproduce it:
(This certainly isn’t the most optimal way of doing such a query, but it looks like it’s functioning properly)
#standardSQL
SELECT
t1.url, t2.cdn, t1.service1
FROM
(SELECT
A.url,
JSON_EXTRACT(payload, '$._blinkFeatureFirstUsed.Features.ServiceWorkerControlledPage') as service1,
FROM `httparchive.pages.2018_04_15_desktop` as A) as t1
INNER JOIN
(SELECT
B.url, B.cdn,
FROM `httparchive.summary_pages.2018_04_15_desktop` as B
) as t2
ON t1.url = t2.url
WHERE
service1 IS NOT NULL
AND
cdn IS NOT NULL
ORDER BY
service1 ASC
I think this is set by WebPageTest. @patmeenan can confirm but to my knowledge this happens when a resource is served through multiple proxies. I wouldn’t expect these cases to happen too often. For your purposes maybe you only want to look at the edge CDN, which I assume is the first one in the list. You could also add a COUNT(0) AS frequency field to the query to get a sense of the number of service workers per CDN.