How many resources change from one crawl to another?
The raw stats are in this Google Spreadsheet.
We get these stats by finding the number of resources that are repeated, and then taking the inverse percentage. A resource is considered “repeated” if the request’s URL, parent URL (i.e., main HTML document), and Last-Modified response header are the all the same. This is an approximation since ~10% of responses have a blank Last-Modified header. (We don’t have the response bodies so we can’t compare them.) In this example query the result is 35,870 repeated resources for the top 1K URLs for desktop from Aug 1 2013 to Aug 15 2013:
select count(s1.requesturl)
FROM ( SELECT pageurl, url as requesturl, resp_last_modified
FROM httparchive:runs.2013_08_01_requests as r2
JOIN (SELECT pageid, rank, url as pageurl FROM httparchive:runs.2013_08_01_pages where rank <= 1000) as p2
ON p2.pageid=r2.pageid) as s2
JOIN EACH ( SELECT pageurl, url as requesturl, resp_last_modified
FROM httparchive:runs.2013_08_15_requests as r1
JOIN (SELECT pageid, rank, url as pageurl FROM httparchive:runs.2013_08_15_pages where rank <= 1000) as p1
ON p1.pageid=r1.pageid) as s1
ON s1.pageurl=s2.pageurl and s1.requesturl=s2.requesturl
WHERE s1.resp_last_modified = s2.resp_last_modified
Then we find the total # of resources. We have to do a join between requests & pages in order to incorporate the rank restriction. For this query the result is 95,542 total resources in the Top 1K URLs:
SELECT count(url) from httparchive:runs.2013_08_15_requests as r2
JOIN (SELECT pageid, rank, url as pageurl FROM httparchive:runs.2013_08_15_pages where rank <= 1000) as p2
ON p2.pageid=r2.pageid;
The repeat rate is 35,870 / 95,542 = 37.54%. The churn rate is 100%-37.54% = 62.46%, as shown in the chart above.
To increase the rank restriction just substitute new values for “1000”. To run the query on mobile just append “_mobile” to the table names, for example the Top 5K on mobile query is:
select count(s1.requesturl)
FROM ( SELECT pageurl, url as requesturl, resp_last_modified
FROM httparchive:runs.2013_08_01_requests_mobile as r2
JOIN (SELECT pageid, rank, url as pageurl FROM httparchive:runs.2013_08_01_pages_mobile where rank <= 5000) as p2
ON p2.pageid=r2.pageid) as s2
JOIN EACH ( SELECT pageurl, url as requesturl, resp_last_modified
FROM httparchive:runs.2013_08_15_requests_mobile as r1
JOIN (SELECT pageid, rank, url as pageurl FROM httparchive:runs.2013_08_15_pages_mobile where rank <= 5000) as p1
ON p1.pageid=r1.pageid) as s1
ON s1.pageurl=s2.pageurl and s1.requesturl=s2.requesturl
WHERE s1.resp_last_modified = s2.resp_last_modified