Is expiration window commensurate with the resource change rate?

I am trying to compare the delta between last-modified date and the date header versus the expiration time to see if they are aggressive or conservative

The first challenge was to convert a RFC 1123 style date format to microseconds and then convert it to seconds so as to enable comparison with the already existing expAge field which does a decent job on the TTL value for a resource

Second, I chose to group by MIME type as each format changes at a different rate.

Finally, I chose median over average but its as easy to compare average

Here’s the final huge query that does the job (am sure this could be rewritten to make it much more efficient)

		SELECT count(requestid) AS ct,
               lower(regexp_extract(regexp_extract(resp_content_type,r'(^[\w\/\-]+)'),r'\/(?:x-)?(.*)')) AS content_type,
               NTH(50, quantiles(expAge)) AS median_TTL,
               NTH(50, quantiles(INTEGER((PARSE_UTC_USEC( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace(resp_date, r'(Mon|Tue|Wed|Thu|Fri|Sat|Sun)[,] ',''), r' GMT',''), r'Jan', '01'), r'Feb', '02'), r'Mar', '03'), r'Apr', '04'), r'May', '05'), r'Jun', '06'), r'Jul', '07'), r'Aug', '08'), r'Sep', '09'), r'Oct', '10'), r'Nov', '11'), r'Dec', '12'), r'(\S+)\s(\S+)\s(\S+)\s(\S+)',r'\3-\2-\1 \4')) - PARSE_UTC_USEC( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace(resp_last_modified, r'(Mon|Tue|Wed|Thu|Fri|Sat|Sun)[,] ',''), r' GMT',''), r'Jan', '01'), r'Feb', '02'), r'Mar', '03'), r'Apr', '04'), r'May', '05'), r'Jun', '06'), r'Jul', '07'), r'Aug', '08'), r'Sep', '09'), r'Oct', '10'), r'Nov', '11'), r'Dec', '12'), r'(\S+)\s(\S+)\s(\S+)\s(\S+)',r'\3-\2-\1 \4')))/POW(10,6)))) AS median_last_change,
        FROM httparchive:runs.latest_requests
        WHERE resp_last_modified != "null"
          AND resp_expires !="null"
          AND resp_date != "null"
          AND resp_last_modified IS NOT NULL
          AND resp_expires IS NOT NULL
          AND resp_date IS NOT NULL
          AND resp_content_type IS NOT NULL
          AND mimeType IS NOT NULL
          AND mimeType != ""
        GROUP BY content_type HAVING ct > 100
        ORDER BY ct DESC LIMIT 10;

Desktop:

Mobile:

My interpretation of the above results:

  • Atleast 50% of the HTML and JSON requests are marked uncacheable whereas the median rate of change happens to be 4.5 days. Maybe we can cache more of HTML and JSON than we think
  • Interestingly most gifs in this dataset are tracking gifs and hence have no cacheability headers and probably the last modified does not apply to such endpoints
  • JPEGs are most cacheable but they change more on mobile and also are more cacheable on mobile. For example the median rate of change on desktop is around 64 days but on mobile its only 10 days while being only cacheable for 1 day on desktop compared to a week on mobile
  • PNGs expire fast on Desktop (20 min) compared to mobile (1 day) which might be due to the deployment of sprites mostly on desktop versus mobile
  • Javascript TTL on mobile is twice that of desktop compared to closer rates of change (40+ days on desktop and 27 odd days on mobile)
  • CSS is way more cacheable on mobile than on desktop probably due to sprites again ( 30 min on desktop versus a day on mobile)
  • EOT fonts are popular on desktop whereas not so on mobile but they have a really long expiry window. WOFF seems to be popular on mobile albeit cacheable for a day
  • I dont see any reason why icons’ median expiry time should be 0
  • Plain text to some degree can be understood to have a 0 median probably due to missing mime type of some kind of application debugging

Let me know if I misinterpreted something or something more can be gleaned from this. I plan to play around by choosing a different metric like avg or mode/p75, loosen the filter on count or look at the tail instead of the current top 10 or even do the same analysis grouped by domain etc. I shall post anything interesting here.

@stevesoudersorg: It was nice meeting you today at the meetup.

This query is an alternative answer to your original blogspot at http://www.stevesouders.com/blog/2012/10/09/webperfdays-performance-tools/ in section Missing Tools bullet point #1

Sitespeed.io already does this in its analyses and its the only tool that gives me this kind of data