New Release: `httparchive.crawl` Dataset

max_ostapenko · November 20, 2024, 3:35pm

New Release: httparchive.crawl Dataset

We’re thrilled to announce the launch of the new httparchive.crawl dataset, now the default and only dataset for HTTP Archive crawls data. This update simplifies data access and brings significant improvements.

After November 2024 this will be the only dataset updated with our monthly crawls and after February 2025 we plan to delete the older datasets to avoid confusion.

What’s New:

Complete History: All HTTP Archive crawl data since June 2011 is now in a single dataset.
Lean & Fast: Cleanup and deduplication make crawl.pages 40% smaller and crawl.requests 15% smaller on average.
Secondary Pages: Since 2022, we’ve been testing secondary pages (is_root_page = FALSE), offering deeper insights.
Partitioned & Clustered: Optimized performance with partitioning by date and up to 4 clustered columns (e.g., client, rank, is_root_page).
Efficient Queries:
- Drill into specific pages or domains (e.g., example.com) in seconds, scanning just ~10MB.
- Focus on specific data types (e.g., images) with clustered type columns in crawl.requests.
Streamlined Data:
- HTTP headers are now easy-to-use RECORD types.
- JSON data is stored as native types — no need for JSON_QUERY or JSON_EXTRACT_SCALAR!

SELECT
  page,
  summary.bytesHtml
FROM httparchive.crawl.pages
WHERE
  date = '2024-10-01' AND
  rank <= 1000

Popular custom_metrics columns are now separate, slashing query costs (e.g., 90GB vs. 9TB for some queries).

SELECT
  client,
  rank,
  page,
  custom_metrics.cookies
FROM httparchive.crawl.pages
WHERE
  date = '2024-10-01'

What’s Next?

Explore: Review our migration guide and test-drive the updated data!
Update Your Queries: Adjust your queries to migrate from legacy datasets:
- all
- pages
- summary_pages
- requests
- summary_requests
- response_bodies
- lighthouse
- technologies
to crawl.pages and crawl.requests.
Plan Ahead:
- Legacy datasets won’t update after November 2024.
- Public access to legacy datasets ends February 2025.

Questions?

We’re here to help! Join the discussion in Slack, this forum, or the GitHub project for support.

Let’s unlock more accessible insights!

— HTTP Archive Maintainers Team

Topic		Replies	Views
Announcing HTTP Archive Beta Meta	4	2201	October 24, 2017
Improving the HTTP Archive pipeline and dataset by 10x Announcements	8	4360	July 1, 2022
Quickstart guide to exploring the HTTP Archive FAQ	0	19088	March 1, 2016
HTTP Archive turns 7! Meta	1	2379	November 16, 2017
Announcing the new HTTP Archive! Announcements	3	13466	April 4, 2018

New Release: `httparchive.crawl` Dataset

What’s New:

What’s Next?

Questions?

Related topics