Hi @pawelgrzybek, welcome to the HTTP Archive community! ![]()
The size of the dataset may seem intimidating at first, but there are a few techniques to minimize the amount of processing and costs involved.
- Use the
httparchive.all.requeststable withWHERE is_main_documentset to ignore everything that’s not an HTML response - Use
WHERE is_root_pageto ignore secondary pages, which I’m guessing are usually similar to home (root) pages in terms of feed metadata - Use
WHERE client = 'mobile'(or'desktop') assuming there are no major differences between clients - Use
TABLESAMPLEto reduce the number of rows processed, ieTABLESAMPLE SYSTEM (10 PERCENT)will “only” process 234 GB (everyone gets 1 TB free per month) - Set up cost controls to ensure you never pay a cent out of pocket unexpectedly
All that said, here’s a query you can run that is 234 GB and will give you a sample of ~10% of pages:
WITH pages AS (
SELECT
REGEXP_EXTRACT_ALL(response_body, r'(?i)(application/rss\+xml|application/atom\+xml|application/feed\+json)') AS feed_types
FROM
`httparchive.all.requests` TABLESAMPLE SYSTEM (10 PERCENT)
WHERE
date = '2023-05-01' AND
client = 'mobile' AND
is_main_document AND
is_root_page
)
SELECT
LOWER(feed_type) AS feed_type,
COUNT(0) AS pages
FROM
pages
LEFT JOIN
UNNEST(feed_types) AS feed_type
GROUP BY
feed_type
ORDER BY
pages DESC
Here are the results:
| feed_type | # pages | % pages |
|---|---|---|
application/rss+xml |
1,214,353 | 50% |
| (none) | 1,045,274 | 43% |
application/atom+xml |
189,469 | 8% |
application/feed+json |
112 | 0% |
So if the sample is representative, we can say that RSS is used by 50% of pages, Atom on 8%, and JSON Feed on <1%. And about 58% of pages use any feed type at all.
Hope that helps!