Resource Churn Across Crawls

stevesoudersorg · September 29, 2013, 11:51pm

How many resources change from one crawl to another?

The raw stats are in this Google Spreadsheet.

We get these stats by finding the number of resources that are repeated, and then taking the inverse percentage. A resource is considered “repeated” if the request’s URL, parent URL (i.e., main HTML document), and Last-Modified response header are the all the same. This is an approximation since ~10% of responses have a blank Last-Modified header. (We don’t have the response bodies so we can’t compare them.) In this example query the result is 35,870 repeated resources for the top 1K URLs for desktop from Aug 1 2013 to Aug 15 2013:

select count(s1.requesturl)
FROM ( SELECT pageurl, url as requesturl, resp_last_modified
       FROM httparchive:runs.2013_08_01_requests as r2
       JOIN (SELECT pageid, rank, url as pageurl FROM httparchive:runs.2013_08_01_pages where rank <= 1000) as p2
       ON p2.pageid=r2.pageid) as s2
     JOIN EACH ( SELECT pageurl, url as requesturl, resp_last_modified
            FROM httparchive:runs.2013_08_15_requests as r1
            JOIN (SELECT pageid, rank, url as pageurl FROM httparchive:runs.2013_08_15_pages where rank <= 1000) as p1
            ON p1.pageid=r1.pageid) as s1
     ON s1.pageurl=s2.pageurl and s1.requesturl=s2.requesturl
     WHERE s1.resp_last_modified = s2.resp_last_modified

Then we find the total # of resources. We have to do a join between requests & pages in order to incorporate the rank restriction. For this query the result is 95,542 total resources in the Top 1K URLs:

SELECT count(url) from httparchive:runs.2013_08_15_requests as r2
       JOIN (SELECT pageid, rank, url as pageurl FROM httparchive:runs.2013_08_15_pages where rank <= 1000) as p2
       ON p2.pageid=r2.pageid;

The repeat rate is 35,870 / 95,542 = 37.54%. The churn rate is 100%-37.54% = 62.46%, as shown in the chart above.

To increase the rank restriction just substitute new values for “1000”. To run the query on mobile just append “_mobile” to the table names, for example the Top 5K on mobile query is:

select count(s1.requesturl)
FROM ( SELECT pageurl, url as requesturl, resp_last_modified
       FROM httparchive:runs.2013_08_01_requests_mobile as r2
       JOIN (SELECT pageid, rank, url as pageurl FROM httparchive:runs.2013_08_01_pages_mobile where rank <= 5000) as p2
       ON p2.pageid=r2.pageid) as s2
     JOIN EACH ( SELECT pageurl, url as requesturl, resp_last_modified
            FROM httparchive:runs.2013_08_15_requests_mobile as r1
            JOIN (SELECT pageid, rank, url as pageurl FROM httparchive:runs.2013_08_15_pages_mobile where rank <= 5000) as p1
            ON p1.pageid=r1.pageid) as s1
     ON s1.pageurl=s2.pageurl and s1.requesturl=s2.requesturl
     WHERE s1.resp_last_modified = s2.resp_last_modified

igrigorik · September 30, 2013, 6:42am

@stevesoudersorg curious, if we look at ETag’s instead of last-modified, does the picture change? In theory, we should allow for both in this analysis, since if ETag is the same, then we know the resource has not changed.

stevesoudersorg · September 30, 2013, 5:41pm

Ilya: Here’s the chart for “the same” using Last-Modified AND ETag equality:

Churn is higher by a few percentage points. See the raw data in the “Resource Churn ETag” worksheet. The sample query is:

select count(s1.requesturl)  
FROM ( SELECT pageurl, url as requesturl, resp_last_modified, resp_etag
       FROM httparchive:runs.2013_08_01_requests as r2
       JOIN (SELECT pageid, rank, url as pageurl FROM httparchive:runs.2013_08_01_pages where rank <= 1000) as p2
       ON p2.pageid=r2.pageid) as s2
     JOIN EACH ( SELECT pageurl, url as requesturl, resp_last_modified, resp_etag
            FROM httparchive:runs.2013_08_15_requests as r1
            JOIN (SELECT pageid, rank, url as pageurl FROM httparchive:runs.2013_08_15_pages where rank <= 1000) as p1
            ON p1.pageid=r1.pageid) as s1
     ON s1.pageurl=s2.pageurl and s1.requesturl=s2.requesturl
     WHERE (s1.resp_last_modified = s2.resp_last_modified AND s1.resp_etag = s2.resp_etag)

Topic		Replies	Views
How many resources persist across a month's period? Analysis	0	2165	July 13, 2015
Data collection in HTTPArchive Analysis	1	1629	January 15, 2019
How many pages include resources from web.archive.org Analysis	0	751	July 4, 2020
Analyzing Resource Age by Content Type Analysis	0	4294	May 27, 2019
How many resources return Last-Modified and/or ETag values? Analysis	1	3440	November 14, 2014

Resource Churn Across Crawls

Related topics