SELECT policy, COUNT(policy) as policy_count,
RATIO_TO_REPORT(policy_count) OVER() as ratio FROM (
SELECT resp_cache_control, resp_expires,
CASE
WHEN resp_cache_control CONTAINS "no-store"
THEN "no-store"
WHEN resp_cache_control CONTAINS "no-cache"
OR resp_cache_control CONTAINS "max-age=0"
OR resp_expires = "-1"
THEN "non-cacheable"
WHEN LENGTH(resp_cache_control) = 0
AND LENGTH(resp_expires) = 0
THEN "undefined"
ELSE "cacheable"
END as policy
FROM [runs.latest_requests]
WHERE firstHtml = true
) GROUP BY policy
ORDER BY policy_count DESC
For 2014-06-15 run, the stats are:
Note that “non-cacheable” means “no-store” was not specified, but the reverse does not hold.
1 Like
Hi, Ilya. Can you clarify what you’re trying to measure? I see “firstHtml = true”. Is this the caching policies only for the main HTML document? If so, the high percentage of “no-store” and “non-cacheable” is understandable. The high percentage of “undefined” is bad - it’s better to specify explicit caching headers rather than leave it up to browser heuristics (which vary by browser).
1 Like
Yep, good catch, updated the title of the post. Was trying to get a feel for CC policies of the HTML documents, and I’m with you on relying on heuristic caching.
A deeper look at the docs that do specify the max-age:
SELECT
INTEGER(REGEXP_EXTRACT(resp_cache_control, r'max-age=(\d+)')) age,
count(pageid) cnt,
RATIO_TO_REPORT(cnt) OVER() as ratio
FROM [httparchive:runs.latest_requests]
WHERE firstHtml = true
AND resp_cache_control CONTAINS 'max-age'
GROUP BY age
HAVING cnt > 500
ORDER BY age asc
50% mark the doc as non-cacheable, followed by spikes at 3s (hmm), 10m, 1hr, and 1 day.