Data collection in HTTPArchive


#1

We extracted URLs from harachrive.runs* table (as in Adoption of HTTP Security Headers on the Web) to study security headers in HTTPArchive data. During this exercise, we noted the number of URLs for 5 specific days (with a gap of few months between any 2 days) does not follow a particular trend (e.g., with time the number URLs in harachrive.runs* table are not always increasing). As shown below, the number of URLs dropped on 12-01-18 (1262020), compared to previous data dump on 01-08-18 (1277461).

date URLs
2017_07_01 475122
2018_01_01 490241
2018_05_01 504672
2018_08_01 1277461
2018_12_01 1262020

Therefore, we would like to know the follows:

  • Isn’t the number of URLs used (in harachrive.runs* table) increase with time? If not why?
  • What is the procedure used to extract headers? ( e.g., follow all redirection and consider the headers in the final landing page if multiple redirection are involved in a page load)

Thanks
Naya


#2

See Changes to the HTTP Archive corpus for more info on when and why the number of URLs changed.

There is a firstHtml boolean field in the summary_requests dataset that indicates whether the request returns the root document of the page, after redirects. You can use it like this:

SELECT
  respOtherHeaders
FROM
  `httparchive.latest.summary_requests_desktop`
WHERE
  firstHtml

Or if you need access to the raw JSON request object, you can join the tables together:

SELECT
  page,
  url,
  JSON_EXTRACT(payload, '$._headers.response') AS all_response_headers
FROM
  `httparchive.latest.summary_requests_desktop`
JOIN
  `httparchive.latest.requests_desktop`
USING
  (url)
WHERE
  firstHtml