Hello fellow archivers,
While working on the Cache Digests proposal, I started wondering if we really need it to include the response ETag in the key used for the digests.
The purpose of the proposal is to enable the server to know what’s in the browser’s cache before it attempts to push resources down to the client (Using HTTP/2 server push).
Adding ETags to the key will add significant complexity to the protocol and to the proposal implementation, so I wanted to see if it would make sense to remove it entirely.
The resources servers would normally push are the render blocking resources of the page, and I had a feeling that these resources are often long term cacheable.
In order to back that gut feeling with some data, I came up (eventually) with the following query:
CREATE TEMPORARY FUNCTION isLongTermCacheable(cacheControl STRING, expires STRING, responseDate STRING, status INT64)
RETURNS BOOL
LANGUAGE js AS """
if(cacheControl.includes("immutable")) {
return true;
}
if(cacheControl.includes("private") || cacheControl.includes("no-store") || cacheControl.includes("no-cache") || (status != 200 && status != 301)){
return 0;
}
var cutoff = 86400;
var reggy = new RegExp("max-age=[ ]*([0-9]+)", "i");
var arr = reggy.exec(cacheControl);
if(!arr || !arr.length) {
if (expires == "") {
return true;
}
var current = new Date();
var expiry_date = new Date(expires);
return((expiry_date - current) > cutoff);
}
var maxAge = parseInt(arr[1], 10);
if (isNaN(maxAge)) {
return true;
}
return(maxAge > cutoff);
""";
SELECT SUM(CASE WHEN Cacheable = true and BeforeRenderStart = 1 THEN 1 ELSE 0 END) / SUM(CASE WHEN not CacheControl like'%private%' and BeforeRenderStart = 1 THEN 1 ELSE 0 END)
FROM
(SELECT *,
IF(req.startedDateTime < (pages.startedDateTime + (pages.renderStart/1000) ),1,0) as BeforeRenderStart,
isLongTermCacheable(req.resp_cache_control, req.resp_expires, req.resp_date, req.status) as Cacheable,
req.resp_cache_control as CacheControl
FROM httparchive.runs.2018_02_15_requests req
JOIN (
SELECT rank, url, pageid, startedDateTime, renderStart
FROM httparchive.runs.2018_02_15_pages
) pages ON pages.pageid = req.pageid
)
That query uses @paulcalvano’s definition of “render blocking” from a previous query. It also uses a JS custom function in order to determine a resource’s cacheability and freshness lifetime.
Running this query is seems that if we stick to the definition os 24H freshness lifetime as long term cacheable, we get that 77.26% of the render blocking resources are long-term cacheable.
It’s debatable whether that’s enough to justify removing ETags from the proposal, but now we can have that discussion with data to back up our opinions