Cache-Control response policy of HTML documents

SELECT policy, COUNT(policy) as policy_count,
RATIO_TO_REPORT(policy_count) OVER() as ratio FROM (
  SELECT resp_cache_control, resp_expires, 
    CASE
      WHEN resp_cache_control CONTAINS "no-store" 
        THEN "no-store"
      WHEN resp_cache_control CONTAINS "no-cache"
        OR resp_cache_control CONTAINS "max-age=0"
        OR resp_expires = "-1" 
        THEN "non-cacheable"
      WHEN LENGTH(resp_cache_control) = 0 
        AND LENGTH(resp_expires) = 0 
        THEN "undefined"
      ELSE "cacheable"
      END as policy
  FROM [runs.latest_requests] 
  WHERE firstHtml = true
) GROUP BY policy
ORDER BY policy_count DESC

For 2014-06-15 run, the stats are:

Note that “non-cacheable” means “no-store” was not specified, but the reverse does not hold.

1 Like

Hi, Ilya. Can you clarify what you’re trying to measure? I see “firstHtml = true”. Is this the caching policies only for the main HTML document? If so, the high percentage of “no-store” and “non-cacheable” is understandable. The high percentage of “undefined” is bad - it’s better to specify explicit caching headers rather than leave it up to browser heuristics (which vary by browser).

1 Like

Yep, good catch, updated the title of the post. Was trying to get a feel for CC policies of the HTML documents, and I’m with you on relying on heuristic caching.

A deeper look at the docs that do specify the max-age:

SELECT 
  INTEGER(REGEXP_EXTRACT(resp_cache_control, r'max-age=(\d+)')) age, 
  count(pageid) cnt,
  RATIO_TO_REPORT(cnt) OVER() as ratio
FROM [httparchive:runs.latest_requests]
WHERE firstHtml = true
  AND resp_cache_control CONTAINS 'max-age'
GROUP BY age
HAVING cnt > 500
ORDER BY age asc

50% mark the doc as non-cacheable, followed by spikes at 3s (hmm), 10m, 1hr, and 1 day.