Working with CSV dumps

mnot · January 8, 2020, 6:53am

Hey,

I’ve been processing the CSV dump of the archive for a little while, to do things that I can’t do in BigQuery (e.g., parsing response headers).

I found the dump on https://legacy.httparchive.org/downloads.php

However, I’ve noticed a couple of issues.

First, it looks like headers (e.g., Cache-Control) are concatenated if there are multiple instances, without inserting a comma. This makes them syntactically invalid, so it’s difficult to judge how prevalent correct headers are on the Web.

To give an example, a header like:

Cache-Control: foo
Cache-Control: bar

will show up in resp_cache_control as “foo bar”, not “foo, bar”.

Second, fields like resp_keep_alive don’t seem to follow conventions used elsewhere; while all other resp_ fields are the value of the corresponding header, resp_keep_alive seems to be a date (and its not clear what its semantic is), or the letter “N” (instead of “\N” which is used elsewhere to indicate null).

I realise it’s probably too late to change the schema, but could we have a bit of documentation as to what each field carries?

rviscomi · January 8, 2020, 7:10am

Hey @mnot! These definitely do sound like usability issues with the CSV data. FWIW the HAR data should contain the headers in a more structured way and parsing them in BigQuery isn’t too difficult. I have lots of examples of this in the Web Almanac queries, especially for the headers analyzed in the Security chapter:

github.com

HTTPArchive/almanac.httparchive.org/blob/master/sql/2019/08_Security/08_29.sql

#standardSQL
# 08_29: Groupings of "x-content-type-options" values
CREATE TEMPORARY FUNCTION extractHeader(payload STRING, name STRING)
RETURNS STRING LANGUAGE js AS '''
try {
  var $ = JSON.parse(payload);
  var header = $._headers.response.find(h => h.toLowerCase().startsWith(name.toLowerCase()));
  if (!header) {
    return null;
  }
  return header.substr(header.indexOf(':') + 1).trim();
} catch (e) {
  return null;
}
''';

SELECT
  client,
  NORMALIZE_AND_CASEFOLD(extractHeader(payload, 'X-Content-Type-Options')) AS value,
  COUNT(0) as freq,

This file has been truncated. show original

Does that help?

rviscomi · January 8, 2020, 7:16am

Stepping back from the tactical side of your question, I’d love to hear more about your use case and see if there’s anything we can do to alleviate the need for querying all together. The Almanac itself is trying to answer the web’s common questions and if you’re finding yourself constantly analyzing the same type of thing, maybe we can do a better job of surfacing that as first class stats on httparchive.org. Similarly, we’ve also built special purpose dashboards like the one on Chrome Status to make feature counter data more accessible. How can we make this easier on you?

mnot · January 8, 2020, 10:21am

Hey!

My use case is that I want to feed headers from the wild into the parsing algorithms for Structured Headers (specifically, my Python implementation) to see how often common headers can successfully be handled that way.

Where can I get the HARs?

charlie.clark · January 8, 2020, 10:40am

Hi Mark,

The HARs are available for each test from webpagetest.org. If you want
some Python code for getting them then you might be able to make use of
the port I did of the website to Python. If you’re planning a massive
grab then you might want to check with Pat Meenan that you don’t get
rate limited as I think the HARs have to come from storage each time.

Sample code:

https://bitbucket.org/charlie_x/python-httparchive/src/default/httparchive/views/harviewer.py

Charlie

patmeenan · January 8, 2020, 1:27pm

The HARs in queryable form are in bigquery. Those get REALLY expensive to query though.

The requests tables also have the raw headers in JSON format (extracted from the HARs):

"response":{
    "status":200,
    "statusText":"",
    "headersSize":586,
    "bodySize":1736,
    "headers":[
    {
        "name":"status",
        "value":"200"
    },
    {
        "name":"expect-ct",
      "value":"max-age=604800, report-uri=\"https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct\""
   },
   {
      "name":"strict-transport-security",
      "value":"max-age=31536000"
   },
   {
      "name":"content-encoding",
      "value":"br"
   },
   {
      "name":"cf-cache-status",
      "value":"HIT"
   },
   {
      "name":"expires",
      "value":"Tue, 12 Nov 2019 09:19:05 GMT"
   },
   {
      "name":"vary",
      "value":"Accept-Encoding"
...

and also (duplicated):

"_headers":{
   "request":[
      "Referer: https://fanatec.com/us-en",
      "User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36 PTST/191105.201111",
      ":method: GET",
      ":authority: fanatec.com",
      ":scheme: https",
      ":path: /themes/Frontend/Fanatec/frontend/_public/src/img/endor-logo_white.svg",
      "user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36 PTST/191105.201111",
      "accept: image/webp,image/apng,image/*,*/*;q=0.8",
      "sec-fetch-site: same-origin",
      "sec-fetch-mode: no-cors",
      "referer: https://fanatec.com/us-en",
      "accept-encoding: gzip, deflate, br",
      "accept-language: en-US,en;q=0.9",
      "cookie: [128 bytes were stripped]"
   ],
   "response":[
      "status: 200",
      "expect-ct: max-age=604800, report-uri=\"https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct\"",
      "strict-transport-security: max-age=31536000",
      "content-encoding: br",
      "cf-cache-status: HIT",
      "expires: Tue, 12 Nov 2019 09:19:05 GMT",
      "vary: Accept-Encoding",
      "server: cloudflare",
      "last-modified: Tue, 05 Nov 2019 05:19:35 GMT",
      "etag: W/\"5dc10667-12de\"",
      "cache-control: public, max-age=604800, must-revalidate, proxy-revalidate",
      "date: Wed, 06 Nov 2019 15:53:50 GMT",
      "cf-ray: 53183e17af5c6d82-SJC",
      "alt-svc: h3-23=\":443\"; ma=86400",
      "content-type: image/svg+xml",
      "age: 110085",
      ":status: 200"
   ]
},

The cookie/auth headers were stripped by Chrome’s netlog capture and were just fixed so those won’t be accurate but everything else should be (though automated parsing of them in bigquery can be a pain).

mnot · January 8, 2020, 9:15pm

Thanks. From what I can see, the problem regarding combining headers is also in bigQuery; e.g.,

SELECT resp_cache_control
FROM `httparchive.summary_requests.2019_10_01_desktop` 
WHERE url = "http://rpg.lv/sites/default/files/images/skola.jpg"

… and compare with https://redbot.org/?uri=http%3A%2F%2Frpg.lv%2Fsites%2Fdefault%2Ffiles%2Fimages%2Fskola.jpg

(I can’t seem to get the raw HAR from the requests tables… known issue or operator error? I didn’t have much time to look into this before running out of credit on the free tier)

So, this looks like a crawl data munging problem to me…

WRT using the HAR – I need to get it all, so it doesn’t sound practical (or friendly).

patmeenan · January 8, 2020, 10:10pm

The summary tables are from the csv and have the same problem. The requests table has the raw json structure:

SELECT JSON_EXTRACT(payload, '$._headers.response') FROM [httparchive:requests.2019_10_01_desktop] WHERE url = 'http://rpg.lv/sites/default/files/images/skola.jpg'

Produces the separate cache-control headers as expected:


[
"HTTP/1.1 200 OK",
"Server: nginx/1.2.0",
"Date: Tue, 01 Oct 2019 04:43:26 GMT",
"Content-Type: image/jpeg",
"Content-Length: 35441",
"Connection: keep-alive",
"Last-Modified: Thu, 20 Jun 2019 08:28:57 GMT",
"Expires: Thu, 31 Oct 2019 04:43:26 GMT",
"Cache-Control: max-age=2592000",
"Cache-Control: public",
"Accept-Ranges: bytes"
]

The should be examples for headers-based analysis queries sprinkled around here and from the almanac but my bigquery foo isn’t that strong so I wouldn’t necessarily know how to pull what you are looking for (Rick and Paul have mad BQ skills though).

mnot · January 8, 2020, 10:23pm

Cool, thanks. I don’t think getting the JSON from BQ is workable (if only for budget :), but if the CSV problem could be fixed at some point, that would be very helpful (it isn’t a blocker). If someone can point me to the source for whatever does that munging, I could create a PR.

patmeenan · January 8, 2020, 10:40pm

I’m pretty sure it’s in here but we’re moving away from the legacy php and csv as much as possible and it may not be supported for long.

If there’s a specific format you’re looking for we may be able to help and run the query for you, generating a text or csv file of some kind.

mnot · January 8, 2020, 10:50pm

I think the issue is here –

$hHeaders[$column] . " $value"

should be

$hHeaders[$column] . “, $value”

The current CSV is good for me delta that one change.

patmeenan · January 9, 2020, 1:06pm

Fixed for both the request and response headers and updated the server. We’re in the middle of the January crawl now so it will only be partially fixed. February’s crawl (ready at the end of February) will be the first full crawl with the fix.

If you actually want them as-delivered from the origin (repeating headers, etc) then the json requests data is the only way to get those currently.

mnot · February 27, 2020, 12:25am

I don’t suppose it’d be possible to get a dump of the JSON on the site at some point? I’m hitting quota limits in BQ…

briankardell · February 27, 2020, 12:42am

Keep forgetting to mention this but I had also mentioned to Rick that the json/csv summaries generated for the tool I made for analysing and sharing info about markup would be really handy to have updates on so we could track trends and history… they’re very small…

rviscomi · March 6, 2020, 6:56pm

Yes @briankardell, we definitely still need to do something to make markup analysis easier. I’ve got some ideas and I’ll put a proposal together.

Topic		Replies	Views
Does BigQuery contain HAR Archive or cookies of crawled webpages?	13	2955	June 11, 2021
Help finding list of home pages with specific http response header Analysis	7	924	June 7, 2023
Is HPACK static table fit for today's web?	9	3582	December 12, 2017
Help regenerating lost summary_requests data Analysis	0	758	March 28, 2019
Video walkthrough of getting started with HTTP Archive on BigQuery Meta	3	1970	March 4, 2018

Working with CSV dumps

Related topics