Pages and Requests later than Apr 2022 in gs://httparchive/downloads/

marty90 · November 15, 2023, 5:15pm

Hello,
I used to download the list of pages and requests from gs://httparchive/downloads/ , where there are files like: httparchive_Sep_1_2021_requests.gz
However, there is no data newer than April 2022. Is the data moved? Or is it no longer offered?
Thank you in advance!
M

rviscomi · November 23, 2023, 7:01pm

Hi @marty90, yeah unfortunately HTTP Archive no longer provides downloadable summaries on GCS. Would you be able to tell us a bit more about your use case?

All of the data is still available on BigQuery, so as a workaround, you can export the data manually.

Note that this may be an expensive operation, given the size of the tables.

Go to https://console.cloud.google.com/bigquery
Navigate to a summary table like httparchive.summary_pages.2023_10_01_desktop
Select “Export to GCS” and select your project’s own GCS bucket

image730×486 27.2 KB

image1376×832 38.3 KB

Once it’s on GCS you should be able to process the data like before.

marty90 · November 23, 2023, 9:26pm

Dear @rviscomi thank you for your help.
Actually, we are working on studying the pervasiveness of Consent Management Platforms (e.g., Iubenda) over time.
So having the summary of pages and requests is fine for us.
I can try to export the tables using Big Query.
But it is a paid service? Is there any free option to get those data?
Thank You
Ciao!
M

rviscomi · November 25, 2023, 8:32pm

GCP does offer a free tier which may or may not be enough for what you’re trying to do.

If you’re comfortable exploring the data directly on BigQuery, you can try some of these techniques to minimize query costs.

For example, the free tier of BigQuery includes 1 TB/month and this query consumes only 40 MB:

-- Top CMPs used on the 100k most popular websites' home pages
SELECT
  t.technology,
  COUNT(0) AS sites
FROM
  `httparchive.all.pages`,
  UNNEST(technologies) AS t,
  UNNEST(t.categories) AS category
WHERE
  -- Only the October 2023 dataset
  date = '2023-10-01' AND
  -- Only mobile pages
  client = 'mobile' AND
  -- Only home pages
  is_root_page AND
  -- Only websites ranked in the top 100k
  rank <= 100000 AND
  category = 'Cookie compliance'
GROUP BY
  technology
ORDER BY
  sites DESC

technology	sites
OneTrust	5,364
Funding Choices	4,720
Didomi	1,128
Cookiebot	1,037
iubenda	404
Conversant Consent Tool	401
Usercentrics	400
CookieYes	356
AdRoll CMP System	241
TrustArc	220
…	…

If you’re only interested in tracking CMP adoption, then the Core Web Vitals Tech Report should get you what you need for free. For example, this dashboard produces results that are effectively the same as the direct query. (The differences are a side effect of joining HTTP Archive with the CrUX dataset).

Technology	Origins
OneTrust	5,739
Funding Choices	5,139
Didomi	1,204
Cookiebot	1,138
Conversant Consent Tool	457
Usercentrics	428
iubenda	411
CookieYes	353
AdRoll CMP System	255
TrustArc	241
…	…

Anything more complicated and you’ll probably need to query the data directly on BigQuery. If you need to stay within the free tier, there are some tricks/tradeoffs you can make to optimize the queries for efficiency.

marty90 · November 27, 2023, 8:17am

Thank you.
It is a bit surprising that there is no way to download the data without paying…
We will try the queries you suggested.
Question: you use Wappalyzer to extract the “Technology” list from a website, right?

rviscomi · November 27, 2023, 5:16pm

Yes, although we’re using a somewhat stale copy of the Wappalyzer detections as of their final open source commit a few months ago.

marty90 · November 27, 2023, 5:56pm

Perfect, thank you, we will try to use the SQL Queries of GC!

Topic		Replies	Views
Downloading HAR-Datasets later than May 2022? Meta	4	1035	October 16, 2023
Accessing Web Almanac 2022's raw data?	2	622	August 30, 2023
如何获取api 报文数据，进行学习？ Analysis	1	463	April 28, 2024
Tracking Page Weight Over Time Analysis	3	8562	March 3, 2021
Querying the HTTP Archive with DuckDB Analysis	0	1029	May 3, 2023

Pages and Requests later than Apr 2022 in gs://httparchive/downloads/

Related topics