Hi @pawelgrzybek, welcome to the HTTP Archive community!
The size of the dataset may seem intimidating at first, but there are a few techniques to minimize the amount of processing and costs involved.
- Use the
httparchive.all.requests
table withWHERE is_main_document
set to ignore everything that’s not an HTML response - Use
WHERE is_root_page
to ignore secondary pages, which I’m guessing are usually similar to home (root) pages in terms of feed metadata - Use
WHERE client = 'mobile'
(or'desktop'
) assuming there are no major differences between clients - Use
TABLESAMPLE
to reduce the number of rows processed, ieTABLESAMPLE SYSTEM (10 PERCENT)
will “only” process 234 GB (everyone gets 1 TB free per month) - Set up cost controls to ensure you never pay a cent out of pocket unexpectedly
All that said, here’s a query you can run that is 234 GB and will give you a sample of ~10% of pages:
WITH pages AS (
SELECT
REGEXP_EXTRACT_ALL(response_body, r'(?i)(application/rss\+xml|application/atom\+xml|application/feed\+json)') AS feed_types
FROM
`httparchive.all.requests` TABLESAMPLE SYSTEM (10 PERCENT)
WHERE
date = '2023-05-01' AND
client = 'mobile' AND
is_main_document AND
is_root_page
)
SELECT
LOWER(feed_type) AS feed_type,
COUNT(0) AS pages
FROM
pages
LEFT JOIN
UNNEST(feed_types) AS feed_type
GROUP BY
feed_type
ORDER BY
pages DESC
Here are the results:
feed_type | # pages | % pages |
---|---|---|
application/rss+xml |
1,214,353 | 50% |
(none) | 1,045,274 | 43% |
application/atom+xml |
189,469 | 8% |
application/feed+json |
112 | 0% |
So if the sample is representative, we can say that RSS is used by 50% of pages, Atom on 8%, and JSON Feed on <1%. And about 58% of pages use any feed type at all.
Hope that helps!