3rd party content: Who is guarding the cache?

A few months ago, @igrigorik looked at the distribution of 1st and 3rd party content per website. I thought it would be interesting to see of all the requests in HTTP Archive – how many are 3rd party vs. first party. What are the files being served by 3rd parties? Are the 3rd parties serving us as developers well?

Let’s see what I found out:

Simply modifying Ilya’s query to remove url reporting, I get:

Wow. Third Party Requests make up 46.7% of all requests found in the HTTPArchive. @guypod made a good point in the initial thread that this simple query will have false positives - (for example, www.cnn.com uses turner.com CDNs for images). These are arguably first party files that are miscategorized by this query (I’ll attempt to address this further in the post).
Next, we add in a few more lines to the query to discover amount of data – and the cacheable nature of each of these files.

  ROUND(SUM(respSize/(1024*1024))) as megabytes,  //HOW MUCH DATA
  SUM(IF ((resp_etag != "" OR resp_cache_control != ""), INTEGER(1), INTEGER(0))) as cacheable_count, //cachable count
  SUM(IF (resp_cache_control != "", INTEGER(1), INTEGER(0))) as cache_control_count, //cachecontrol

I’ve broken this down into a table:

Not unexpectedly, 3rd party content appears to be a smaller percentage of the bandwidth (compared to % of requests - ~40% of the total data while ~47% of all requests). The files used by 3rd party services are smaller: first party content averages ~20KB per file, while 3rd party averages ~15 KB per file.

How does 3rd party content measure up when it comes to caching?

When it comes to caching, there is good news. Nearly 90% of 3rd party content have cache headers present (90.2% cache Control and 9.8% E-tag). I am actually surprised by this number, while there is room for improvement, I did not think that 3rd party caching would be this high.

Unfortunately, it appears that FIRST party content – the data developers ACTUALLY have control over – is LESS likely to have cache headers than 3rd party. If cache headers are present – 54% are cache control, and 46% are E-tag (requiring a round trip for validation). And 14.5% of first party files have NO cache headers at all. Clearly, there is room for improvement here.

Back to 3rd party content. It is a very simple thing to group all of these by the domain that delivers this content. By adding:

IF (req_host CONTAINS REGEXP_EXTRACT(origin, r'([\w-]+)'), INTEGER(1), INTEGER(3)) AS party,
SUM(IF ((resp_etag != "" OR resp_cache_control != "") , respSize, INTEGER(0))) as cacheable_MB,

to the search – we gain a list of all requests by domain (and a way to discover the MB traffic being left at the table). This somewhat addresses the point of false positives raised by Guy, the 500th domain has 2,371 requests – so is likely a domain used by many websites (I HOPE!). The top domain by request count is the profile.ak.fbcdn.net with 456,149 requests (99.9% images). Only 72 files are missing cache headers.

Of the 12.8M 3rd party requests seen in the first query, 8.8M are found in top 500 sites (~69% of 3rd party requests). Extending the query to 1,000 domains yields 9.5M (74.2%) of 3rd party queries. For simplicity, let’s stick to the top 500 for the remaining analysis.

Overall, the caching for the top 3rd party domains is pretty good:

92.5% of requests and 98.5% of the payload have cache headers. But there is still 1.5 GB of data not cached. How does that break down by domain?

As the graph shows, most of these 3rd party domains are doing great with caching. In fact, 333 (exactly two thirds) of the top sites have over 99.44% of their content labeled with cache headers (that’s pretty pure!). 394 are over 90% cachable.

There are 42 sites that have cache headers for <5% of their files. Closer examination shows that all deal exclusively with small files (All 42 sites are under 2KB average file sizeand 39 average < 1KB). You can imagine the tracking beacons seen in waterfall charts 43 bytes here and 25 bytes there for gif or html pages. This is ~3% of all the requests, but these files account for just 0.05% of the bandwidth.

The sites that are of most concern are the ones that fall in the middle. I arbitrarily defined “bad caching domains” as:

There are 11 sites that fall into all 3 categories. They account for 1,394 MB (88.96%) of the non-cached traffic. And they all fall into the middle category (averaging 58% cacheable content.) These are tricky to discover in regular testing, as your tests may pass these 3rd parties 50% of the time.

Mobile:

Doing the same set of queries for the top 3rd party domains on mobile, we find that the ratio of 3rd:1st party content is smaller – but the caching ratios are VERY similar to the web (perhaps slightly higher - but the sample size is much smaller).

Limiting the search to the top 85 3rd party sites (arbitrarily limited to 200 requests per domain), we see that files with cache headers are slightly better compared to web (97.2% to 93.5%), but this might just be due to the smaller sample size.

Ok, so what can we conclude from this data? I guess my conclusion would be constant vigilance. It is important to check your 3rd party providers, especially with several different test scenarios. We have seen that domains with the greatest count and payload of files without cache headers DO have the headers a portion of the time. It might work fine for your test, but fail 50% (or more) of your customers.

1 Like