If we check two runs that are a month apart, how many of the same resources do we see? Where same is defined as same URL and same content size – many resources don’t carry a revalidation token, hence I’m using content-size as a “good enough™” hash.
Counting number of resources for Top 100K sites:
SELECT COUNT(url) as reqs,
ROUND(SUM(respBodySize)/(1024*1024)) as respBodySizeMB
FROM httparchive:runs.2015_01_15_requests as r JOIN (
SELECT pageid, rank, url as pageurl
FROM httparchive:runs.2015_01_15_pages
WHERE rank <= 100000
) as p ON p.pageid=r.pageid
Compare two runs to identify same resources:
SELECT COUNT(s1.requesturl) as updated_resources,
ROUND(SUM(s1.respSize)/(1024*1024)) as new_respSizeMB,
ROUND(SUM(s2.respSize)/(1024*1024)) as old_respSizeMB,
FROM (
SELECT pageurl, url as requesturl, respSize
FROM httparchive:runs.2015_01_15_requests as r2 JOIN (
SELECT pageid, rank, url as pageurl
FROM httparchive:runs.2015_01_15_pages
WHERE rank <= 100000
) as p2 ON p2.pageid = r2.pageid) as s2
JOIN EACH (
SELECT pageurl, url as requesturl, respSize
FROM httparchive:runs.2015_02_15_requests as r1 JOIN (
SELECT pageid, rank, url as pageurl
FROM httparchive:runs.2015_02_15_pages
WHERE rank <= 100000
) as p1 ON p1.pageid = r1.pageid
/* same page, same URL */
) as s1 ON s1.pageurl = s2.pageurl AND s1.requesturl = s2.requesturl
/* response size is the same (~same resource; not 100% but close enough) */
WHERE s1.respSize == s2.respSize
- ~48% of resource requests end up requesting the same URL. Of those…
- ~84% fetch the same content (~40% of all request and ~33% of total bytes)
- ~16% fetch different content (~8% of all requests and ~9% of total bytes)