Hello,
I used to download the list of pages and requests from gs://httparchive/downloads/ , where there are files like: httparchive_Sep_1_2021_requests.gz
However, there is no data newer than April 2022. Is the data moved? Or is it no longer offered?
Thank you in advance!
M
Hi @marty90, yeah unfortunately HTTP Archive no longer provides downloadable summaries on GCS. Would you be able to tell us a bit more about your use case?
All of the data is still available on BigQuery, so as a workaround, you can export the data manually.
Note that this may be an expensive operation, given the size of the tables.
- Go to https://console.cloud.google.com/bigquery
- Navigate to a summary table like
httparchive.summary_pages.2023_10_01_desktop
- Select “Export to GCS” and select your project’s own GCS bucket
Once it’s on GCS you should be able to process the data like before.
Dear @rviscomi thank you for your help.
Actually, we are working on studying the pervasiveness of Consent Management Platforms (e.g., Iubenda) over time.
So having the summary of pages and requests is fine for us.
I can try to export the tables using Big Query.
But it is a paid service? Is there any free option to get those data?
Thank You
Ciao!
M
GCP does offer a free tier which may or may not be enough for what you’re trying to do.
If you’re comfortable exploring the data directly on BigQuery, you can try some of these techniques to minimize query costs.
For example, the free tier of BigQuery includes 1 TB/month and this query consumes only 40 MB:
-- Top CMPs used on the 100k most popular websites' home pages
SELECT
t.technology,
COUNT(0) AS sites
FROM
`httparchive.all.pages`,
UNNEST(technologies) AS t,
UNNEST(t.categories) AS category
WHERE
-- Only the October 2023 dataset
date = '2023-10-01' AND
-- Only mobile pages
client = 'mobile' AND
-- Only home pages
is_root_page AND
-- Only websites ranked in the top 100k
rank <= 100000 AND
category = 'Cookie compliance'
GROUP BY
technology
ORDER BY
sites DESC
technology | sites |
---|---|
OneTrust | 5,364 |
Funding Choices | 4,720 |
Didomi | 1,128 |
Cookiebot | 1,037 |
iubenda | 404 |
Conversant Consent Tool | 401 |
Usercentrics | 400 |
CookieYes | 356 |
AdRoll CMP System | 241 |
TrustArc | 220 |
… | … |
If you’re only interested in tracking CMP adoption, then the Core Web Vitals Tech Report should get you what you need for free. For example, this dashboard produces results that are effectively the same as the direct query. (The differences are a side effect of joining HTTP Archive with the CrUX dataset).
Technology | Origins |
---|---|
OneTrust | 5,739 |
Funding Choices | 5,139 |
Didomi | 1,204 |
Cookiebot | 1,138 |
Conversant Consent Tool | 457 |
Usercentrics | 428 |
iubenda | 411 |
CookieYes | 353 |
AdRoll CMP System | 255 |
TrustArc | 241 |
… | … |
Anything more complicated and you’ll probably need to query the data directly on BigQuery. If you need to stay within the free tier, there are some tricks/tradeoffs you can make to optimize the queries for efficiency.
Thank you.
It is a bit surprising that there is no way to download the data without paying…
We will try the queries you suggested.
Question: you use Wappalyzer to extract the “Technology” list from a website, right?
Yes, although we’re using a somewhat stale copy of the Wappalyzer detections as of their final open source commit a few months ago.
Perfect, thank you, we will try to use the SQL Queries of GC!