Resource Churn Across Crawls

How many resources change from one crawl to another?

The raw stats are in this Google Spreadsheet.

We get these stats by finding the number of resources that are repeated, and then taking the inverse percentage. A resource is considered “repeated” if the request’s URL, parent URL (i.e., main HTML document), and Last-Modified response header are the all the same. This is an approximation since ~10% of responses have a blank Last-Modified header. (We don’t have the response bodies so we can’t compare them.) In this example query the result is 35,870 repeated resources for the top 1K URLs for desktop from Aug 1 2013 to Aug 15 2013:

select count(s1.requesturl)
FROM ( SELECT pageurl, url as requesturl, resp_last_modified
       FROM httparchive:runs.2013_08_01_requests as r2
       JOIN (SELECT pageid, rank, url as pageurl FROM httparchive:runs.2013_08_01_pages where rank <= 1000) as p2
       ON p2.pageid=r2.pageid) as s2
     JOIN EACH ( SELECT pageurl, url as requesturl, resp_last_modified
            FROM httparchive:runs.2013_08_15_requests as r1
            JOIN (SELECT pageid, rank, url as pageurl FROM httparchive:runs.2013_08_15_pages where rank <= 1000) as p1
            ON p1.pageid=r1.pageid) as s1
     ON s1.pageurl=s2.pageurl and s1.requesturl=s2.requesturl
     WHERE s1.resp_last_modified = s2.resp_last_modified

Then we find the total # of resources. We have to do a join between requests & pages in order to incorporate the rank restriction. For this query the result is 95,542 total resources in the Top 1K URLs:

SELECT count(url) from httparchive:runs.2013_08_15_requests as r2
       JOIN (SELECT pageid, rank, url as pageurl FROM httparchive:runs.2013_08_15_pages where rank <= 1000) as p2
       ON p2.pageid=r2.pageid;

The repeat rate is 35,870 / 95,542 = 37.54%. The churn rate is 100%-37.54% = 62.46%, as shown in the chart above.

To increase the rank restriction just substitute new values for “1000”. To run the query on mobile just append “_mobile” to the table names, for example the Top 5K on mobile query is:

select count(s1.requesturl)
FROM ( SELECT pageurl, url as requesturl, resp_last_modified
       FROM httparchive:runs.2013_08_01_requests_mobile as r2
       JOIN (SELECT pageid, rank, url as pageurl FROM httparchive:runs.2013_08_01_pages_mobile where rank <= 5000) as p2
       ON p2.pageid=r2.pageid) as s2
     JOIN EACH ( SELECT pageurl, url as requesturl, resp_last_modified
            FROM httparchive:runs.2013_08_15_requests_mobile as r1
            JOIN (SELECT pageid, rank, url as pageurl FROM httparchive:runs.2013_08_15_pages_mobile where rank <= 5000) as p1
            ON p1.pageid=r1.pageid) as s1
     ON s1.pageurl=s2.pageurl and s1.requesturl=s2.requesturl
     WHERE s1.resp_last_modified = s2.resp_last_modified
1 Like

@stevesoudersorg curious, if we look at ETag’s instead of last-modified, does the picture change? In theory, we should allow for both in this analysis, since if ETag is the same, then we know the resource has not changed.

Ilya: Here’s the chart for “the same” using Last-Modified AND ETag equality:

Churn is higher by a few percentage points. See the raw data in the “Resource Churn ETag” worksheet. The sample query is:

select count(s1.requesturl)  
FROM ( SELECT pageurl, url as requesturl, resp_last_modified, resp_etag
       FROM httparchive:runs.2013_08_01_requests as r2
       JOIN (SELECT pageid, rank, url as pageurl FROM httparchive:runs.2013_08_01_pages where rank <= 1000) as p2
       ON p2.pageid=r2.pageid) as s2
     JOIN EACH ( SELECT pageurl, url as requesturl, resp_last_modified, resp_etag
            FROM httparchive:runs.2013_08_15_requests as r1
            JOIN (SELECT pageid, rank, url as pageurl FROM httparchive:runs.2013_08_15_pages where rank <= 1000) as p1
            ON p1.pageid=r1.pageid) as s1
     ON s1.pageurl=s2.pageurl and s1.requesturl=s2.requesturl
     WHERE (s1.resp_last_modified = s2.resp_last_modified AND s1.resp_etag = s2.resp_etag)