HPACK is a header compression technique that can be used with HTTP/2. RFC 7541 defined a “Static table”, used by all implementations, which encodes 61 “common” HTTP headers. This was measured during the design process.
I am wondering if the picture has changed with today’s web. This could be analysed by listing HTTP header usage by frequency and coming up with a compression ratio during a subsequent analysis step. Other topics take a deep dive on particular headers but I’m interested in a broad look.
Some ideas for queries that might expose the popularity of request and response header fields (individual tables for request and response):
Unique key-value pairs are counted carefully (e.g. Accept-Encoding: gzip and Accept-Encoding: br,gzip are counted as different items).
Frequency of header field keys are counted.
Frequency of header field values are counted.
I’m not familiar with BigQuery and was hoping someone might be kind enough to get me off to a start.
Hi @LPardue. It is possible to extract this data by analyzing the HAR files for all HTTP requests / responses. The SQL for this is a bit tricky, and uses a user defined function in Javascript to parse the JSON from the HAR files. Here’s the query, and I’ll explain how it works below -
CREATE TEMPORARY FUNCTION getHeaderNames(payload STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
var $ = JSON.parse(payload);
var headers = $.request.headers;
var headernames=[];
for (i in headers) {
// commented out header values due to memory constraints
headernames[i] = headers[i].name ;// + ': ' + headers[i].value;
}
try {
return headernames;
} catch (e) {
return [];
}
""";
SELECT headername,COUNT(*) freq
FROM (
SELECT getHeaderNames(lower(payload)) headernames
FROM `httparchive.har.2017_11_15_chrome_requests`
)
CROSS JOIN
UNNEST(headernames) AS headername
GROUP BY headername
ORDER BY freq DESC
This query processes 170GB of data (so ~ 17% of the montly free tier quota). Here’s how it works:
The user defined function at the beginning of the query extracts all HTTP request header names from the HAR JSON. It also returns an array of header names for each URL.
The SQL query calls the function, and then unpacks the array to summarize each header name.
This ran against 48,452,989 HTTP requests, which is the total number of requests in the httparchive.har.2017_11_15_chrome_requests table.
The top 11 results bellow account for 99% of the request headers -
The long tail on this contains lots of GET and POST headers, but you can see the top 50 request headers below:
headername freq
user-agent 48452783
accept-encoding 47627263
accept-language 47543174
accept 47496982
referer 45986708
host 32229273
connection 32226439
cookie 23433939
origin 3735390
upgrade-insecure-requests 2774537
content-type 778686
content-length 625993
x-requested-with 306126
intervention 192905
cache-control 186065
purpose 75940
range 75114
access-control-request-method 72488
access-control-request-headers 71987
x-youtube-client-version 66288
x-youtube-client-name 66288
if-modified-since 56419
get /ibs 49092
get /getuid?https 47859
if-none-match 44268
pe-token 42908
x-goog-visitor-id 42021
pragma 34946
sec-websocket-version 32782
upgrade 32782
sec-websocket-extensions 32779
sec-websocket-key 32779
if-range 24343
get /demconf.jpg?et 20505
dpr 19495
viewport-width 17104
get /css?family=open+sans 16956
get /getuid?http 11547
x-sumo-auth 8847
x-newrelic-id 8431
get /css?family=roboto 7578
authorization 6678
x-csrf-token 6314
get /css?family=lato 5664
get /entry/image/http 5276
get /sync/img?redir=https 3863
get /adserver/pug?vcode=bz0yjnr5cgu9mszjb2rlptm2miz0bd00mziwma==&piggybackcookie=uid 3386
get /bh/rtset?pid=561498&ev=1&rurl=http 3246
get /pm_match?https 3199
get /sync/img?redir=http 3184
That query can also be modified to look at HTTP response headers as well. For example, the top 11 response headers are below and account for 64% of all response headers.
The top 50 of these account for 95% of all HTTP response headers. Here’s a list of the top 100 from the results -
Thank you! This is way beyond anything I had initially hoped for.
I notice that you comment out header values due to memory constraints. Is this a blocker to processing the data, or just something that requires suitable resource scaling? (Asking to understand what processing I might need to do, not requesting you to run it again).
The values are of interest because the HPACK static table doesn’t include many values. Anecdotally, there have been comments that a revised table that included more values could be useful.
When I was writing this query, I was trying to get the name/value pairs. I ran a test on a smaller table that had only 1000 requests in it, to refine my query w/o wasting too much data processing time/quota. So I know that the query would run - but when I ran it against the larger dataset I got the following error -
i suspect that the amount of unique name/value pairs in the data caused the UNNEST function to hit a memory limit. Since it’s doubtful that this will be of much value in aggregate, it might be best to take the analysis for the values out of BigQuery.
The subquery from my example returned 48 million arrays with 27GB of data.
SELECT getHeaderNames(lower(payload)) headernames
FROM `httparchive.har.2017_11_15_chrome_requests`
Really interesting data… Thanks! I notice several entries in the top 50 are "get ", which presumably means you’re pulling in part of the request line here. We’d probably want to filter those out to leave actual headers.
For HTTP/2 and HTTP/QUIC, we’d actually want to include the pseudo-headers (:method, :authority, :path), though I think we can guess likely front-runners here and they’re already in the static table.
Also, would it be possible to do the query in two stages to deal with the memory constraints; first find the 50 most common header names, then go find the ~5 most common values for that header?
Also worth noting that these are first views, and if-modified-since and if-none-match headers will be under-represented.
I tried @MikeBishop’s idea about removing the “get” and “post” entries. I also removed a few others that I expected to contain mostly unique data: host, referer, cookie, origin and content-length. The modified query returned 148,188 unique name/value pairs for request headers. You can see a summary below -
The query I used to remove the headers is below. Since the SQL query is working with an arrays and the memory limit was reached unpacking the array - I got around the issue by filtering out the headers I mentioned above via JavaScript while creating the array. Since the SQL query had a smaller array to work with, it was able to sort and unnest it.
CREATE TEMPORARY FUNCTION getHeaderNames(payload STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
var $ = JSON.parse(payload);
var headers = $.request.headers;
var headernames=[];
j=0;
for (i in headers) {
if (
headers[i].name.indexOf('get /') === -1
&& headers[i].name.indexOf('post /') === -1
&& headers[i].name.indexOf('host') === -1
&& headers[i].name.indexOf('referer') === -1
&& headers[i].name.indexOf('cookie') === -1
&& headers[i].name.indexOf('origin') === -1
&& headers[i].name.indexOf('content-length') === -1
) {
headernames[j] = headers[i].name + ': ' + headers[i].value;
j++;
}
}
try {
return headernames;
} catch (e) {
return [];
}
""";
SELECT headername,COUNT(*) freq
FROM (
SELECT getHeaderNames(lower(payload)) headernames
FROM `httparchive.har.2017_11_15_chrome_requests`
)
CROSS JOIN
UNNEST(headernames) AS headername
GROUP BY headername
ORDER BY freq DESC