What is the most popular feed format: rss, atom, or JSON?

Hi @pawelgrzybek, welcome to the HTTP Archive community! :wave:

The size of the dataset may seem intimidating at first, but there are a few techniques to minimize the amount of processing and costs involved.

  • Use the httparchive.all.requests table with WHERE is_main_document set to ignore everything that’s not an HTML response
  • Use WHERE is_root_page to ignore secondary pages, which I’m guessing are usually similar to home (root) pages in terms of feed metadata
  • Use WHERE client = 'mobile' (or 'desktop') assuming there are no major differences between clients
  • Use TABLESAMPLE to reduce the number of rows processed, ie TABLESAMPLE SYSTEM (10 PERCENT) will “only” process 234 GB (everyone gets 1 TB free per month)
  • Set up cost controls to ensure you never pay a cent out of pocket unexpectedly

All that said, here’s a query you can run that is 234 GB and will give you a sample of ~10% of pages:

WITH pages AS (
  SELECT
    REGEXP_EXTRACT_ALL(response_body, r'(?i)(application/rss\+xml|application/atom\+xml|application/feed\+json)') AS feed_types
  FROM
    `httparchive.all.requests` TABLESAMPLE SYSTEM (10 PERCENT)
  WHERE
    date = '2023-05-01' AND
    client = 'mobile' AND
    is_main_document AND
    is_root_page
)

SELECT
  LOWER(feed_type) AS feed_type,
  COUNT(0) AS pages
FROM
  pages
LEFT JOIN
  UNNEST(feed_types) AS feed_type
GROUP BY
  feed_type
ORDER BY
  pages DESC

Here are the results:

feed_type # pages % pages
application/rss+xml 1,214,353 50%
(none) 1,045,274 43%
application/atom+xml 189,469 8%
application/feed+json 112 0%

So if the sample is representative, we can say that RSS is used by 50% of pages, Atom on 8%, and JSON Feed on <1%. And about 58% of pages use any feed type at all.

Hope that helps!