Better gzip settings

I just wrote a Fastly blog post where I talk about using the HTTP Archive data to uncover better gzip defaults. Here are some of the queries I used.

I did a lot of analysis to see the connection between file extensions and Content-Types. Here’s the query and results for “.json” responses. We see that 3% of JSON responses don’t even have a Content-Type. 16% are “application/json”. Most of them are “application/javascript”. And there are some weird Content-Types like “application/octet-stream” and “js”.

SELECT regexp_extract(resp_content_type, r'^([^;]*)') as stat, count(*) as num, RATIO_TO_REPORT(num) OVER() AS ratio
FROM [httparchive:runs.2014_09_01_requests]
WHERE ( regexp_match(url, r'\.(json)$') OR regexp_match(url, r'\.(json)\?') )
ORDER BY num desc

Using the same query but replacing “json” with “eot” gives the results below. 71% of EOTs are accurately typed as “font/eot”, but “application/” is in second place with 17%, followed by “application/octet-stream” at 6%, “text/plain” at 3%, and “text/html” at 1%. It’s clear that EOT fonts do not have accurate Content-Types.

A question about EOT fonts is whether or not they should be gzipped. It’s possible that the EOT font file is already compressed, but this is an extra step in the file creation process that’s not widely adopted. The following query looks at the Content-Encoding header for EOT responses. The results show that 79% of them are gzipped. Assuming the website owner knows best, it’s good to gzip EOT responses by default.

SELECT resp_content_encoding as stat, count(*) as num, RATIO_TO_REPORT(num) OVER() as ratio
FROM [httparchive:runs.2014_09_01_requests]
WHERE ( regexp_match(url, r'\.(eot)$') OR regexp_match(url, r'\.(eot)\?') )
      ( regexp_match(lower(resp_content_type), r'^(font/eot)\s*$') OR regexp_match(lower(resp_content_type), r'^(font/eot)\s*\?') )
ORDER BY num desc

Generally we assume that any Content-Type that begins with “image/” should not be compressed, but that’s not always true. “.ico” images and “image/svg+xml” responses benefit from being gzipped. We can quantify this using the _gzip_save column. The first query below shows that “.ico” images can be reduced by 53% using gzip, while “image/svg+xml” responses are reduced by 47%.

SELECT count(*), round(100*avg(_gzip_save) / avg(respBodySize)) as percent
FROM [httparchive:runs.2014_09_01_requests]
WHERE ( regexp_match(url, r'\.(ico)$') OR regexp_match(url, r'\.(ico)\?') )
      AND not resp_content_encoding contains "gzip"
SELECT count(*), round(100*avg(_gzip_save) / avg(respBodySize)) as percent
FROM [httparchive:runs.2014_09_01_requests]
WHERE ( regexp_match(lower(resp_content_type), r'^(image/svg\+xml)\s*$') OR regexp_match(lower(resp_content_type), r'^(image/svg\+xml)\s*;') )

@stevesoudersorg awesome analysis - kudos. I’ve been digging into font compression data as well… just need to tidy up the data and publish the results. That said, a quick note:

Examining the 460K .eot responses in the HTTP Archive shows that the website owner specifically gzipped them 79% of the time. Therefore, gzipping EOT fonts is a good default.

I think it’s actually much worse… To start, I’d recommend excluding font CDNs from above analysis. Google Fonts + Typekit have their act together, and due to their popularity drive the compression numbers up. But, the moment you start looking at self-hosted solutions, you quickly realize that we have a lot of room for optimization - e.g. it seems like very few servers are configured to gzip “application/”.

As an experiment, I pulled out raw fontobject (eot) URLs, downloaded the files, compressed them locally with gzip and got the following results:

tl;dr… Enabling gzip for “application/” results in:

  • 1kb+ savings for 43.06% of requests
  • 5kb+ savings for 42.98% of requests
  • 10kb+ savings for 29.14% of requests

Further, I’m suspect of WPT “gzip savings” numbers… (well, actually, I don’t trust them at all at the moment):

Pulling up run in HA shows that we report 0’s for gzip savings for all of those files - hmm? I’ve checked, and its not the import… the HAR files report 0’s. Digging further:

Something’s amiss… I think our current HA gzip savings numbers are severely undercounting the actual gains. /cc @pmeenan

1 Like

Here’s min, max, and avg for “_gzip_save” broken out by Content-Type. We see some Content-Types have all zeroes for _gzip_save. It’s unusual that there wouldn’t be at least one response that had a positive savings from gzip across thousands of responses. I wonder if _gzip_save is skipped for certain Content-Types.

SELECT regexp_extract(resp_content_type, r'^([^;]*)') as content_type, count(*) as num, 
       min(_gzip_save) as min, max(_gzip_save) as max, round(avg(_gzip_save)) as avg
FROM [httparchive:runs.2014_09_01_requests]
WHERE not resp_content_encoding contains "gzip" AND respBodySize > 1024
GROUP BY content_type HAVING num > 1000 ORDER BY max asc

Just heard from @pmeenan that he pushed the fix today to HTTP Archive. Thanks Pat!!