Data collection in HTTPArchive

nayanamanas · January 15, 2019, 2:15am

We extracted URLs from harachrive.runs* table (as in Adoption of HTTP Security Headers on the Web) to study security headers in HTTPArchive data. During this exercise, we noted the number of URLs for 5 specific days (with a gap of few months between any 2 days) does not follow a particular trend (e.g., with time the number URLs in harachrive.runs* table are not always increasing). As shown below, the number of URLs dropped on 12-01-18 (1262020), compared to previous data dump on 01-08-18 (1277461).

date	URLs
2017_07_01	475122
2018_01_01	490241
2018_05_01	504672
2018_08_01	1277461
2018_12_01	1262020

Therefore, we would like to know the follows:

Isn’t the number of URLs used (in harachrive.runs* table) increase with time? If not why?
What is the procedure used to extract headers? ( e.g., follow all redirection and consider the headers in the final landing page if multiple redirection are involved in a page load)

Thanks
Naya

rviscomi · January 15, 2019, 6:19pm

See Changes to the HTTP Archive corpus for more info on when and why the number of URLs changed.

There is a firstHtml boolean field in the summary_requests dataset that indicates whether the request returns the root document of the page, after redirects. You can use it like this:

SELECT
  respOtherHeaders
FROM
  `httparchive.latest.summary_requests_desktop`
WHERE
  firstHtml

Or if you need access to the raw JSON request object, you can join the tables together:

SELECT
  page,
  url,
  JSON_EXTRACT(payload, '$._headers.response') AS all_response_headers
FROM
  `httparchive.latest.summary_requests_desktop`
JOIN
  `httparchive.latest.requests_desktop`
USING
  (url)
WHERE
  firstHtml

Topic		Replies	Views
The growth of HTTPS requests Analysis	3	2196	November 25, 2014
Quickstart guide to exploring the HTTP Archive FAQ	0	19186	March 1, 2016
New Release: `httparchive.crawl` Dataset Meta	1	173	November 20, 2024
Where to find http headers in the http archive datasets Analysis	2	1428	April 7, 2022
Resource Churn Across Crawls Analysis	2	2135	September 30, 2013

Data collection in HTTPArchive

Related topics