What Percentage of Third Party Content is Cacheable?

#1

When we talk about cacheability of web content, often times the discussion is around content that site operators have control over (ie, first party content). But what about third party content? How much of that is cacheable? I was chatting with @yoav about this on Friday, since it could be useful to understanding the benefits of signed exchanges on accelerating third party content. Is it worth delivering cross origin resources on a site’s HTTP/2 connection, avoiding the need to establish a new connection and eliminate bandwidth contention between 3rd party resources and 1st party ones? In order to answer that we need to understand how many third party resources are delivered without credentials, and therefore can be signed. We will use the resource’s public cacheability as a proxy for that, and try to understand how common such third party resources are.

The concept of serving signed third party resources over the same connection as the main content is described in some more detail here.

In order to query the HTTP Archive for this, we need to:

  • Identify which resources are served from 3rd parties
  • Determine which resources are considered cacheable.

Identifying 3rd Party Content
@igrigorik shared a technique in his post about the distribution of 1st vs 3rd party resources. . I’ve used it for another analysis, and we’ll use it again here.

Identifying Cacheable Content
Examining the Cache-Control response headers, we should be able to filter out resources that are not publicly cacheable or require immediate revalidation. This would include Cache-Control headers that have 1 or more of the following directives:

  • no-store
  • no-cache
  • max-age=0
  • s-max-age=0
  • must-revalidate
  • private

The query for this is below, and based on it we can see that 30% of 3rd party requests are considered to be non-cacheable or cached with the private directive.

SELECT REGEXP_CONTAINS(resp_cache_control, r"no-store|no-cache|max-age=0|s-max-age=0|must-revalidate|private") AS is_noncacheable,
       COUNT(*) requests
FROM `httparchive.summary_requests.2019_02_01_desktop` requests 
JOIN `httparchive.summary_pages.2019_02_01_desktop` pages
ON pages.pageid = requests.pageid
WHERE STRPOS(req_host,REGEXP_EXTRACT(NET.REG_DOMAIN(pages.url), r'([\w-]+)'))<=0
GROUP BY is_noncacheable

HTTP Response codes are another useful dimension here, and it looks like 81% of 3rd party content returning an HTTP 200 status code is cacheable.

SELECT status, REGEXP_CONTAINS(resp_cache_control, r"no-store|no-cache|max-age=0|s-max-age=0|must-revalidate|private") AS is_noncacheable,
       COUNT(*) requests
FROM `httparchive.summary_requests.2019_02_01_desktop` requests 
JOIN `httparchive.summary_pages.2019_02_01_desktop` pages
ON pages.pageid = requests.pageid
WHERE STRPOS(req_host,REGEXP_EXTRACT(NET.REG_DOMAIN(pages.url), r'([\w-]+)'))<=0
GROUP BY status, is_noncacheable

Taking this one step further, we can also evaluate the public cacheability of resources from popular third party domains.

SELECT NET.HOST(requests.url) req_host,
       REGEXP_CONTAINS(resp_cache_control, r"no-store|no-cache|max-age=0|s-max-age=0|must-revalidate|private") AS is_noncacheable,
       COUNT(*) requests
FROM `httparchive.summary_requests.2019_02_01_desktop` requests 
JOIN `httparchive.summary_pages.2019_02_01_desktop` pages
ON pages.pageid = requests.pageid
WHERE STRPOS(req_host,REGEXP_EXTRACT(NET.REG_DOMAIN(pages.url), r'([\w-]+)'))<=0
      and status=200
GROUP BY req_host, is_noncacheable
ORDER BY requests DESC
LIMIT 10000

Based on these results, it definitely seems that there’s a significant amount of third party content that is considered publicly cacheable based on the cache control headers. I’m looking forward to seeing how the proposal for signed exchanges impacts the delivery of third party content - especially the ones that sites need to load early in the critical render path.

Thanks to Yoav Weiss for his help with this analysis!

1 Like