I recently queried the HTTP Archive to get the number of websites (pages) that serve the HTML with response header cache-control contains no-store .
My query:
SELECT
_TABLE_SUFFIX AS client,
count(DISTINCT pageid) AS pages,
pages_total,
count(DISTINCT pageid) / pages_total AS pages_pct,
FROM
`httparchive.summary_requests.2021_08_01_*`
JOIN
(SELECT _TABLE_SUFFIX, COUNT(DISTINCT pageid) AS pages_total FROM `httparchive.summary_requests.2021_08_01_*` GROUP BY _TABLE_SUFFIX)
USING (_TABLE_SUFFIX)
WHERE
LOWER(resp_cache_control) LIKE '%no-store%' AND
firstHTML
GROUP BY
_TABLE_SUFFIX,
pages_total
Now, I’d like take a next step and get the number of pages with that header, by CDN.
Say the total is 2,000,000 pages with that header’s directive on mobile, how many of those are served by Akamai, Cloudflare, Fastly, etc?
But ideally I can get the % of pages that each CDN serves with that header’s directive (a bfcache killer ), for example:
Akamai:
Total pages: 200,000
Pages with no-store: 100,000
% pages with no-store: 50%
My SQL is rusty at best and I can’t get it to work in BigQuery.
Anybody have some pointers for me? Txs !
I took a stab at this using the new all.pages table for CDN detection and all.requests table for no-store detection. My general approach is to use WITH subqueries for each of these pieces, then bring them together with a JOIN in the final SELECT statement.
WITH no_store AS (
SELECT
page
FROM
`httparchive.all.requests`,
UNNEST(response_headers) AS header
WHERE
date = '2023-08-01' AND
client = 'mobile' AND
is_root_page AND
is_main_document AND
LOWER(header.name) = 'cache-control' AND
REGEXP_CONTAINS(header.value, r'no-store')
),
cdns AS (
SELECT
rank,
page,
JSON_VALUE(summary, '$.cdn') AS cdn
FROM
`httparchive.all.pages`
WHERE
date = '2023-08-01' AND
client = 'mobile' AND
is_root_page AND
JSON_VALUE(summary, '$.cdn') != ''
)
SELECT
cdn,
COUNTIF(no_store.page IS NOT NULL) AS pages_with_no_store,
COUNT(0) AS total_pages,
COUNTIF(no_store.page IS NOT NULL) / COUNT(0) AS pct_no_store
FROM
cdns
LEFT JOIN
no_store
USING
(page)
GROUP BY
cdn
HAVING
total_pages >= 1000
ORDER BY
total_pages DESC
Note that the we’re detecting CDNs based on WebPageTest.
Here are the results for CDNs with 1k+ sites, ordered by popularity:
cdn
pages_with_no_store
total_pages
pct_no_store
Cloudflare
391,675
2,548,966
15.4%
Google
139,629
1,099,485
12.7%
Fastly
12,681
363,678
3.5%
Amazon CloudFront
66,242
297,571
22.3%
Akamai
43,628
115,532
37.8%
Automattic
5
105,874
0.0%
Sucuri Firewall
8,157
53,357
15.3%
Vercel
6,326
47,304
13.4%
Incapsula
11,806
38,333
30.8%
Microsoft Azure
4,926
20,522
24.0%
CDN
1,173
19,184
6.1%
Cloudflare, Google
4,858
16,901
28.7%
OVH CDN
3,669
14,450
25.4%
Netlify
282
13,806
2.0%
Cloudflare, Fastly
665
13,434
5.0%
Amazon CloudFront, Cloudflare
1,123
12,282
9.1%
Highwinds
5,875
11,414
51.5%
Zenedge
10,353
10,700
96.8%
GoCache
8
8,208
0.1%
section.io
4,041
7,820
51.7%
Azion
118
6,581
1.8%
Alibaba
595
5,531
10.8%
QUIC.cloud
405
4,214
9.6%
Fastly, Google
141
3,551
4.0%
Tencent
170
2,698
6.3%
BunnyCDN
27
2,528
1.1%
CDNetworks
314
1,338
23.5%
Here’s another view of the top 10 CDNs by % no-store:
Thanks @rviscomi you are the man!
The query looks great and … so simple
Thanks for including some data in your response, that is very nice.
First comments on the data:
1,099,485 pages served by Google :
That sure is not companies who are using Google Cloud CDN for accelerating their website. I guess there are many, many Google web properties in HTTP Archive (.blogspot.com?)
“Cloudflare, Google”: does that signify multi-cdn?
Here’s the WPT source where that detection is done, if interested:
You’re right, when I look at the most common domain names behind the Google CDN, most of them are blogspot.com:
SELECT
APPROX_TOP_COUNT(NET.REG_DOMAIN(page), 10)
FROM
`httparchive.all.pages`
WHERE
date = '2023-08-01' AND
client = 'mobile' AND
is_root_page AND
JSON_VALUE(summary, '$.cdn') = 'Google'
domain
pages
blogspot.com
249,999
business.site
28,680
softonic.com
19,368
negocio.site
6,744
eatbu.com
5,908
softonic.com.br
3,253
repl.co
1,784
netlify.app
1,721
choiceqr.com
1,487
play-cricket.com
1,467
Yes IIRC that means it’s multi-CDN, with Cloudflare in front. @patmeenan can correct me if I’m wrong.
Thanks @rviscomi and yes, I know of the WPT logic for CDN detection.
That needs a clean up … NetDNA, HiberniaCDN and others should no longer be in there, and I wonder how WPT handles the case where Akamai has a CloudFront origin … which response header wins?
Anyway …