How many resources persist across a month's period?

If we check two runs that are a month apart, how many of the same resources do we see? Where same is defined as same URL and same content size – many resources don’t carry a revalidation token, hence I’m using content-size as a “good enough™” hash.

Counting number of resources for Top 100K sites:

SELECT COUNT(url) as reqs, 
    ROUND(SUM(respBodySize)/(1024*1024)) as respBodySizeMB 
FROM httparchive:runs.2015_01_15_requests as r JOIN (
  SELECT pageid, rank, url as pageurl 
  FROM httparchive:runs.2015_01_15_pages 
  WHERE rank <= 100000
) as p ON p.pageid=r.pageid

Compare two runs to identify same resources:

SELECT COUNT(s1.requesturl) as updated_resources, 
    ROUND(SUM(s1.respSize)/(1024*1024)) as new_respSizeMB,
    ROUND(SUM(s2.respSize)/(1024*1024)) as old_respSizeMB,
FROM (
  SELECT pageurl, url as requesturl, respSize
  FROM httparchive:runs.2015_01_15_requests as r2 JOIN (
    SELECT pageid, rank, url as pageurl 
    FROM httparchive:runs.2015_01_15_pages 
    WHERE rank <= 100000
  ) as p2 ON p2.pageid = r2.pageid) as s2
  JOIN EACH (
    SELECT pageurl, url as requesturl, respSize
    FROM httparchive:runs.2015_02_15_requests as r1 JOIN (
      SELECT pageid, rank, url as pageurl 
      FROM httparchive:runs.2015_02_15_pages
      WHERE rank <= 100000
    ) as p1 ON p1.pageid = r1.pageid
    /*  same page, same URL */ 
  ) as s1 ON s1.pageurl = s2.pageurl AND s1.requesturl = s2.requesturl
/* response size is the same (~same resource; not 100% but close enough) */
WHERE s1.respSize == s2.respSize

  • ~48% of resource requests end up requesting the same URL. Of those…
    • ~84% fetch the same content (~40% of all request and ~33% of total bytes)
    • ~16% fetch different content (~8% of all requests and ~9% of total bytes)