Is HPACK static table fit for today's web?

LPardue · December 11, 2017, 12:46am

HPACK is a header compression technique that can be used with HTTP/2. RFC 7541 defined a “Static table”, used by all implementations, which encodes 61 “common” HTTP headers. This was measured during the design process.

I am wondering if the picture has changed with today’s web. This could be analysed by listing HTTP header usage by frequency and coming up with a compression ratio during a subsequent analysis step. Other topics take a deep dive on particular headers but I’m interested in a broad look.

Some ideas for queries that might expose the popularity of request and response header fields (individual tables for request and response):

Unique key-value pairs are counted carefully (e.g. Accept-Encoding: gzip and Accept-Encoding: br,gzip are counted as different items).
Frequency of header field keys are counted.
Frequency of header field values are counted.

I’m not familiar with BigQuery and was hoping someone might be kind enough to get me off to a start.

yoav · December 11, 2017, 8:19am

Would also be extremely interesting to find other types of recurring structures that are not covered by HPACK, e.g. Link header parameter names, etc.

andydavies · December 11, 2017, 8:49am

Headers are spread across multiple columns in the MySQL tables so pretty sure the data will need to come from the HARs

(and due to the way I can’t have an individual paid for account in the Europe I can’t query the HARs)

paulcalvano · December 11, 2017, 12:39pm

Hi @LPardue. It is possible to extract this data by analyzing the HAR files for all HTTP requests / responses. The SQL for this is a bit tricky, and uses a user defined function in Javascript to parse the JSON from the HAR files. Here’s the query, and I’ll explain how it works below -

CREATE TEMPORARY FUNCTION getHeaderNames(payload STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
  var $ = JSON.parse(payload);
  var headers = $.request.headers;
  
  var headernames=[];
  for (i in headers) { 
    // commented out header values due to memory constraints
    headernames[i] = headers[i].name ;// + ': ' + headers[i].value;
  }
  
  try {
    return headernames;
  } catch (e) {
    return [];
  }
""";

SELECT headername,COUNT(*) freq
FROM (
    SELECT getHeaderNames(lower(payload)) headernames
    FROM `httparchive.har.2017_11_15_chrome_requests` 
)
CROSS JOIN 
   UNNEST(headernames) AS headername
GROUP BY headername
ORDER BY freq DESC

This query processes 170GB of data (so ~ 17% of the montly free tier quota). Here’s how it works:

The user defined function at the beginning of the query extracts all HTTP request header names from the HAR JSON. It also returns an array of header names for each URL.
The SQL query calls the function, and then unpacks the array to summarize each header name.
This ran against 48,452,989 HTTP requests, which is the total number of requests in the httparchive.har.2017_11_15_chrome_requests table.

The top 11 results bellow account for 99% of the request headers -

The long tail on this contains lots of GET and POST headers, but you can see the top 50 request headers below:

headername	freq
user-agent	48452783
accept-encoding	47627263
accept-language	47543174
accept	47496982
referer	45986708
host	32229273
connection	32226439
cookie	23433939
origin	3735390
upgrade-insecure-requests	2774537
content-type	778686
content-length	625993
x-requested-with	306126
intervention	192905
cache-control	186065
purpose	75940
range	75114
access-control-request-method	72488
access-control-request-headers	71987
x-youtube-client-version	66288
x-youtube-client-name	66288
if-modified-since	56419
get /ibs	49092
get /getuid?https	47859
if-none-match	44268
pe-token	42908
x-goog-visitor-id	42021
pragma	34946
sec-websocket-version	32782
upgrade	32782
sec-websocket-extensions	32779
sec-websocket-key	32779
if-range	24343
get /demconf.jpg?et	20505
dpr	19495
viewport-width	17104
get /css?family=open+sans	16956
get /getuid?http	11547
x-sumo-auth	8847
x-newrelic-id	8431
get /css?family=roboto	7578
authorization	6678
x-csrf-token	6314
get /css?family=lato	5664
get /entry/image/http	5276
get /sync/img?redir=https	3863
get /adserver/pug?vcode=bz0yjnr5cgu9mszjb2rlptm2miz0bd00mziwma==&piggybackcookie=uid	3386
get /bh/rtset?pid=561498&ev=1&rurl=http	3246
get /pm_match?https	3199
get /sync/img?redir=http	3184

That query can also be modified to look at HTTP response headers as well. For example, the top 11 response headers are below and account for 64% of all response headers.

The top 50 of these account for 95% of all HTTP response headers. Here’s a list of the top 100 from the results -

headername	freq
date	47071441
content-type	45370679
server	41595364
content-length	41112624
cache-control	36484759
last-modified	32527870
expires	29032443
connection	25895592
accept-ranges	25003324
etag	22332796
vary	16064465
status	15898646
content-encoding	15416023
set-cookie	12301567
access-control-allow-origin	11010651
x-content-type-options	10607493
x-xss-protection	9384486
p3p	8584881
pragma	7718250
age	7445236
timing-allow-origin	6685614
alt-svc	6569840
keep-alive	6420470
x-powered-by	5009325
x-cache	4944992
location	4780423
via	4533294
strict-transport-security	3946032
transfer-encoding	3871217
cf-ray	3642111
x-frame-options	3216704
cf-cache-status	3180066
access-control-allow-credentials	2886140
content-disposition	1893230
access-control-expose-headers	1599272
access-control-allow-methods	1557108
content-security-policy	1508829
x-amz-cf-id	1443179
x-served-by	1362890
x-fb-debug	1266523
access-control-allow-headers	1189030
link	1136298
x-timer	1085528
x-cache-hits	1032858
x-varnish	838290
content-md5	809961
x-amz-request-id	686002
x-aspnet-version	682203
x-amz-id-2	676642
fastly-debug-digest	629371
x-request-id	498904
x-response-time	497399
x-connection-hash	486021
x-ua-compatible	472185
access-control-max-age	461262
x-cdn	423700
non-authoritative-reason	375569
cf-bgj	367572
cf-polished	366912
expect-ct	349689
x-nc	342278
x-type	337019
x-amz-version-id	311895
content-language	305243
x-proxy-cache	287059
surrogate-key	285868
an-x-request-uuid	255319
x-proxy-origin	255319
x-cacheable	245103
x-frontend	198174
x-cache-status	197430
x-bytes-saved	194568
eagleid	193705
x-swift-savetime	192915
x-swift-cachetime	192915
x-server	192008
x-iinfo	179786
bk-server	179065
x-powered-by-plesk	170700
x-robots-tag	166131
x-url	166010
x-edge-location	155133
x-cache-lookup	141441
x-via	137618
content-transfer-encoding	130830
x-image	122710
host-header	122237
x-backend	119884
x-turbo-charged-by	119870
x-host	118199
x-iplb-instance	117504
x-server-name	115382
x-hw	113233
microsoftsharepointteamservices	111830
x-ms-invokeapp	111306
x-age	110777
x-hello-human	109561
x-goog-hash	107835
x-sucuri-id	106804
ms-author-via	105494

BTW - if you are looking for a guide to getting up and running with BigQuery, I wrote a quick getting started guide here - https://github.com/HTTPArchive/httparchive/blob/master/docs/bigquery-gettingstarted.md.

LPardue · December 11, 2017, 1:06pm

Hi @paulcalvano,

Thank you! This is way beyond anything I had initially hoped for.

I notice that you comment out header values due to memory constraints. Is this a blocker to processing the data, or just something that requires suitable resource scaling? (Asking to understand what processing I might need to do, not requesting you to run it again).

The values are of interest because the HPACK static table doesn’t include many values. Anecdotally, there have been comments that a revised table that included more values could be useful.

paulcalvano · December 11, 2017, 2:52pm

Thanks! Happy to help

When I was writing this query, I was trying to get the name/value pairs. I ran a test on a smaller table that had only 1000 requests in it, to refine my query w/o wasting too much data processing time/quota. So I know that the query would run - but when I ran it against the larger dataset I got the following error -

i suspect that the amount of unique name/value pairs in the data caused the UNNEST function to hit a memory limit. Since it’s doubtful that this will be of much value in aggregate, it might be best to take the analysis for the values out of BigQuery.

The subquery from my example returned 48 million arrays with 27GB of data.

    SELECT getHeaderNames(lower(payload)) headernames
    FROM `httparchive.har.2017_11_15_chrome_requests`

@fhoffa any ideas?

MikeBishop · December 11, 2017, 3:12pm

Really interesting data… Thanks! I notice several entries in the top 50 are "get ", which presumably means you’re pulling in part of the request line here. We’d probably want to filter those out to leave actual headers.

For HTTP/2 and HTTP/QUIC, we’d actually want to include the pseudo-headers (:method, :authority, :path), though I think we can guess likely front-runners here and they’re already in the static table.

Also, would it be possible to do the query in two stages to deal with the memory constraints; first find the 50 most common header names, then go find the ~5 most common values for that header?

paulcalvano · December 11, 2017, 3:17pm

Thanks @MikeBishop. That sounds like a great idea. I’ll try that when I have some more time to play with this later.

rviscomi · December 11, 2017, 8:19pm

It’s worth noting that HTTP Archive only tests in a single browser (Chrome) and so the headers may not be entirely representative.

paulcalvano · December 12, 2017, 2:38am

Also worth noting that these are first views, and if-modified-since and if-none-match headers will be under-represented.

I tried @MikeBishop’s idea about removing the “get” and “post” entries. I also removed a few others that I expected to contain mostly unique data: host, referer, cookie, origin and content-length. The modified query returned 148,188 unique name/value pairs for request headers. You can see a summary below -

Here’s a link to the full output - 7.6MB, 148K records. https://storage.googleapis.com/pcalvano_bigqueryexport/request_headers_namevalues.csv

The query I used to remove the headers is below. Since the SQL query is working with an arrays and the memory limit was reached unpacking the array - I got around the issue by filtering out the headers I mentioned above via JavaScript while creating the array. Since the SQL query had a smaller array to work with, it was able to sort and unnest it.

CREATE TEMPORARY FUNCTION getHeaderNames(payload STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
  var $ = JSON.parse(payload);
  var headers = $.request.headers;
  
  var headernames=[];
  j=0;
  for (i in headers) { 
    if (
        headers[i].name.indexOf('get /') === -1 
        && headers[i].name.indexOf('post /') === -1
        && headers[i].name.indexOf('host') === -1
        && headers[i].name.indexOf('referer') === -1
        && headers[i].name.indexOf('cookie') === -1 
        && headers[i].name.indexOf('origin') === -1 
        && headers[i].name.indexOf('content-length') === -1 
      ) {
        headernames[j] = headers[i].name + ': ' + headers[i].value;
        j++;
      }
   }
  
  try {
    return headernames;
  } catch (e) {
    return [];
  }
""";

SELECT headername,COUNT(*) freq
FROM (
SELECT getHeaderNames(lower(payload)) headernames
FROM `httparchive.har.2017_11_15_chrome_requests`
)
CROSS JOIN 
   UNNEST(headernames) AS headername
GROUP BY headername
ORDER BY freq DESC

Topic		Replies	Views
HTTP/2 Adoption	21	9145	May 6, 2020
Working with CSV dumps Analysis	14	2240	March 6, 2020
Does BigQuery contain HAR Archive or cookies of crawled webpages?	13	2968	June 11, 2021
How Is Server-Timing used on the web? Analysis	17	5178	August 3, 2017
Data collection in HTTPArchive Analysis	1	1639	January 15, 2019

Is HPACK static table fit for today's web?

Related topics