New Release: `httparchive.crawl` Dataset

:tada: New Release: httparchive.crawl Dataset

We’re thrilled to announce the launch of the new httparchive.crawl dataset, now the default and only dataset for HTTP Archive crawls data. This update simplifies data access and brings significant improvements.

After November 2024 this will be the only dataset updated with our monthly crawls and after February 2025 we plan to delete the older datasets to avoid confusion.

:rocket: What’s New:

  • Complete History: All HTTP Archive crawl data since June 2011 is now in a single dataset.
  • Lean & Fast: Cleanup and deduplication make crawl.pages 40% smaller and crawl.requests 15% smaller on average.
  • Secondary Pages: Since 2022, we’ve been testing secondary pages (is_root_page = FALSE), offering deeper insights.
  • Partitioned & Clustered: Optimized performance with partitioning by date and up to 4 clustered columns (e.g., client, rank, is_root_page).
  • Efficient Queries:
    • Drill into specific pages or domains (e.g., example.com) in seconds, scanning just ~10MB.
    • Focus on specific data types (e.g., images) with clustered type columns in crawl.requests.
  • Streamlined Data:
    • HTTP headers are now easy-to-use RECORD types.
    • JSON data is stored as native types — no need for JSON_QUERY or JSON_EXTRACT_SCALAR!
SELECT
  page,
  summary.bytesHtml
FROM httparchive.crawl.pages
WHERE
  date = '2024-10-01' AND
  rank <= 1000
  • Popular custom_metrics columns are now separate, slashing query costs (e.g., 90GB vs. 9TB for some queries).
SELECT
  client,
  rank,
  page,
  custom_metrics.cookies
FROM httparchive.crawl.pages
WHERE
  date = '2024-10-01'

:date: What’s Next?

  1. Explore: Review our migration guide and test-drive the updated data!

  2. Update Your Queries: Adjust your queries to migrate from legacy datasets:

    • all
    • pages
    • summary_pages
    • requests
    • summary_requests
    • response_bodies
    • lighthouse
    • technologies

    to crawl.pages and crawl.requests.

  3. Plan Ahead:

    • Legacy datasets won’t update after November 2024.
    • Public access to legacy datasets ends February 2025.

:speech_balloon: Questions?

We’re here to help! Join the discussion in Slack, this forum, or the GitHub project for support.

Let’s unlock more accessible insights!

HTTP Archive Maintainers Team