Working with CSV dumps

Hey,

I’ve been processing the CSV dump of the archive for a little while, to do things that I can’t do in BigQuery (e.g., parsing response headers).

I found the dump on https://legacy.httparchive.org/downloads.php

However, I’ve noticed a couple of issues.

First, it looks like headers (e.g., Cache-Control) are concatenated if there are multiple instances, without inserting a comma. This makes them syntactically invalid, so it’s difficult to judge how prevalent correct headers are on the Web.

To give an example, a header like:

Cache-Control: foo
Cache-Control: bar

will show up in resp_cache_control as “foo bar”, not “foo, bar”.

Second, fields like resp_keep_alive don’t seem to follow conventions used elsewhere; while all other resp_ fields are the value of the corresponding header, resp_keep_alive seems to be a date (and its not clear what its semantic is), or the letter “N” (instead of “\N” which is used elsewhere to indicate null).

I realise it’s probably too late to change the schema, but could we have a bit of documentation as to what each field carries?

1 Like

Hey @mnot! These definitely do sound like usability issues with the CSV data. FWIW the HAR data should contain the headers in a more structured way and parsing them in BigQuery isn’t too difficult. I have lots of examples of this in the Web Almanac queries, especially for the headers analyzed in the Security chapter:

Does that help?

Stepping back from the tactical side of your question, I’d love to hear more about your use case and see if there’s anything we can do to alleviate the need for querying all together. The Almanac itself is trying to answer the web’s common questions and if you’re finding yourself constantly analyzing the same type of thing, maybe we can do a better job of surfacing that as first class stats on httparchive.org. Similarly, we’ve also built special purpose dashboards like the one on Chrome Status to make feature counter data more accessible. How can we make this easier on you? :grin:

Hey!

My use case is that I want to feed headers from the wild into the parsing algorithms for Structured Headers (specifically, my Python implementation) to see how often common headers can successfully be handled that way.

Where can I get the HARs?

Hi Mark,

The HARs are available for each test from webpagetest.org. If you want
some Python code for getting them then you might be able to make use of
the port I did of the website to Python. If you’re planning a massive
grab then you might want to check with Pat Meenan that you don’t get
rate limited as I think the HARs have to come from storage each time.

Sample code:

https://bitbucket.org/charlie_x/python-httparchive/src/default/httparchive/views/harviewer.py

Charlie

The HARs in queryable form are in bigquery. Those get REALLY expensive to query though.

The requests tables also have the raw headers in JSON format (extracted from the HARs):

"response":{
    "status":200,
    "statusText":"",
    "headersSize":586,
    "bodySize":1736,
    "headers":[
    {
        "name":"status",
        "value":"200"
    },
    {
        "name":"expect-ct",
      "value":"max-age=604800, report-uri=\"https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct\""
   },
   {
      "name":"strict-transport-security",
      "value":"max-age=31536000"
   },
   {
      "name":"content-encoding",
      "value":"br"
   },
   {
      "name":"cf-cache-status",
      "value":"HIT"
   },
   {
      "name":"expires",
      "value":"Tue, 12 Nov 2019 09:19:05 GMT"
   },
   {
      "name":"vary",
      "value":"Accept-Encoding"
...

and also (duplicated):

"_headers":{
   "request":[
      "Referer: https://fanatec.com/us-en",
      "User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36 PTST/191105.201111",
      ":method: GET",
      ":authority: fanatec.com",
      ":scheme: https",
      ":path: /themes/Frontend/Fanatec/frontend/_public/src/img/endor-logo_white.svg",
      "user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36 PTST/191105.201111",
      "accept: image/webp,image/apng,image/*,*/*;q=0.8",
      "sec-fetch-site: same-origin",
      "sec-fetch-mode: no-cors",
      "referer: https://fanatec.com/us-en",
      "accept-encoding: gzip, deflate, br",
      "accept-language: en-US,en;q=0.9",
      "cookie: [128 bytes were stripped]"
   ],
   "response":[
      "status: 200",
      "expect-ct: max-age=604800, report-uri=\"https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct\"",
      "strict-transport-security: max-age=31536000",
      "content-encoding: br",
      "cf-cache-status: HIT",
      "expires: Tue, 12 Nov 2019 09:19:05 GMT",
      "vary: Accept-Encoding",
      "server: cloudflare",
      "last-modified: Tue, 05 Nov 2019 05:19:35 GMT",
      "etag: W/\"5dc10667-12de\"",
      "cache-control: public, max-age=604800, must-revalidate, proxy-revalidate",
      "date: Wed, 06 Nov 2019 15:53:50 GMT",
      "cf-ray: 53183e17af5c6d82-SJC",
      "alt-svc: h3-23=\":443\"; ma=86400",
      "content-type: image/svg+xml",
      "age: 110085",
      ":status: 200"
   ]
},

The cookie/auth headers were stripped by Chrome’s netlog capture and were just fixed so those won’t be accurate but everything else should be (though automated parsing of them in bigquery can be a pain).

1 Like

Thanks. From what I can see, the problem regarding combining headers is also in bigQuery; e.g.,

SELECT resp_cache_control
FROM `httparchive.summary_requests.2019_10_01_desktop` 
WHERE url = "http://rpg.lv/sites/default/files/images/skola.jpg"

… and compare with https://redbot.org/?uri=http%3A%2F%2Frpg.lv%2Fsites%2Fdefault%2Ffiles%2Fimages%2Fskola.jpg

(I can’t seem to get the raw HAR from the requests tables… known issue or operator error? I didn’t have much time to look into this before running out of credit on the free tier)

So, this looks like a crawl data munging problem to me…

WRT using the HAR – I need to get it all, so it doesn’t sound practical (or friendly).

The summary tables are from the csv and have the same problem. The requests table has the raw json structure:

SELECT JSON_EXTRACT(payload, '$._headers.response') FROM [httparchive:requests.2019_10_01_desktop] WHERE url = 'http://rpg.lv/sites/default/files/images/skola.jpg'

Produces the separate cache-control headers as expected:


[
"HTTP/1.1 200 OK",
"Server: nginx/1.2.0",
"Date: Tue, 01 Oct 2019 04:43:26 GMT",
"Content-Type: image/jpeg",
"Content-Length: 35441",
"Connection: keep-alive",
"Last-Modified: Thu, 20 Jun 2019 08:28:57 GMT",
"Expires: Thu, 31 Oct 2019 04:43:26 GMT",
"Cache-Control: max-age=2592000",
"Cache-Control: public",
"Accept-Ranges: bytes"
]

The should be examples for headers-based analysis queries sprinkled around here and from the almanac but my bigquery foo isn’t that strong so I wouldn’t necessarily know how to pull what you are looking for (Rick and Paul have mad BQ skills though).

Cool, thanks. I don’t think getting the JSON from BQ is workable (if only for budget :), but if the CSV problem could be fixed at some point, that would be very helpful (it isn’t a blocker). If someone can point me to the source for whatever does that munging, I could create a PR.

I’m pretty sure it’s in here but we’re moving away from the legacy php and csv as much as possible and it may not be supported for long.

If there’s a specific format you’re looking for we may be able to help and run the query for you, generating a text or csv file of some kind.

I think the issue is here

$hHeaders[$column] . " $value"

should be

$hHeaders[$column] . “, $value”

The current CSV is good for me delta that one change.

Fixed for both the request and response headers and updated the server. We’re in the middle of the January crawl now so it will only be partially fixed. February’s crawl (ready at the end of February) will be the first full crawl with the fix.

If you actually want them as-delivered from the origin (repeating headers, etc) then the json requests data is the only way to get those currently.

1 Like