I just wrote a Fastly blog post where I talk about using the HTTP Archive data to uncover better gzip defaults. Here are some of the queries I used.
I did a lot of analysis to see the connection between file extensions and Content-Types. Here’s the query and results for “.json” responses. We see that 3% of JSON responses don’t even have a Content-Type. 16% are “application/json”. Most of them are “application/javascript”. And there are some weird Content-Types like “application/octet-stream” and “js”.
SELECT regexp_extract(resp_content_type, r'^([^;]*)') as stat, count(*) as num, RATIO_TO_REPORT(num) OVER() AS ratio
FROM [httparchive:runs.2014_09_01_requests]
WHERE ( regexp_match(url, r'\.(json)$') OR regexp_match(url, r'\.(json)\?') )
GROUP BY stat
ORDER BY num desc
Using the same query but replacing “json” with “eot” gives the results below. 71% of EOTs are accurately typed as “font/eot”, but “application/vnd.ms-fontobject” is in second place with 17%, followed by “application/octet-stream” at 6%, “text/plain” at 3%, and “text/html” at 1%. It’s clear that EOT fonts do not have accurate Content-Types.
A question about EOT fonts is whether or not they should be gzipped. It’s possible that the EOT font file is already compressed, but this is an extra step in the file creation process that’s not widely adopted. The following query looks at the Content-Encoding header for EOT responses. The results show that 79% of them are gzipped. Assuming the website owner knows best, it’s good to gzip EOT responses by default.
SELECT resp_content_encoding as stat, count(*) as num, RATIO_TO_REPORT(num) OVER() as ratio
FROM [httparchive:runs.2014_09_01_requests]
WHERE ( regexp_match(url, r'\.(eot)$') OR regexp_match(url, r'\.(eot)\?') )
OR
( regexp_match(lower(resp_content_type), r'^(font/eot)\s*$') OR regexp_match(lower(resp_content_type), r'^(font/eot)\s*\?') )
GROUP BY stat
ORDER BY num desc
Generally we assume that any Content-Type that begins with “image/” should not be compressed, but that’s not always true. “.ico” images and “image/svg+xml” responses benefit from being gzipped. We can quantify this using the _gzip_save column. The first query below shows that “.ico” images can be reduced by 53% using gzip, while “image/svg+xml” responses are reduced by 47%.
SELECT count(*), round(100*avg(_gzip_save) / avg(respBodySize)) as percent
FROM [httparchive:runs.2014_09_01_requests]
WHERE ( regexp_match(url, r'\.(ico)$') OR regexp_match(url, r'\.(ico)\?') )
AND not resp_content_encoding contains "gzip"
SELECT count(*), round(100*avg(_gzip_save) / avg(respBodySize)) as percent
FROM [httparchive:runs.2014_09_01_requests]
WHERE ( regexp_match(lower(resp_content_type), r'^(image/svg\+xml)\s*$') OR regexp_match(lower(resp_content_type), r'^(image/svg\+xml)\s*;') )