What is the most popular feed format: rss, atom, or JSON?

rviscomi · May 26, 2023, 3:16pm

Hi @pawelgrzybek, welcome to the HTTP Archive community!

The size of the dataset may seem intimidating at first, but there are a few techniques to minimize the amount of processing and costs involved.

Use the httparchive.all.requests table with WHERE is_main_document set to ignore everything that’s not an HTML response
Use WHERE is_root_page to ignore secondary pages, which I’m guessing are usually similar to home (root) pages in terms of feed metadata
Use WHERE client = 'mobile' (or 'desktop') assuming there are no major differences between clients
Use TABLESAMPLE to reduce the number of rows processed, ie TABLESAMPLE SYSTEM (10 PERCENT) will “only” process 234 GB (everyone gets 1 TB free per month)
Set up cost controls to ensure you never pay a cent out of pocket unexpectedly

All that said, here’s a query you can run that is 234 GB and will give you a sample of ~10% of pages:

WITH pages AS (
  SELECT
    REGEXP_EXTRACT_ALL(response_body, r'(?i)(application/rss\+xml|application/atom\+xml|application/feed\+json)') AS feed_types
  FROM
    `httparchive.all.requests` TABLESAMPLE SYSTEM (10 PERCENT)
  WHERE
    date = '2023-05-01' AND
    client = 'mobile' AND
    is_main_document AND
    is_root_page
)

SELECT
  LOWER(feed_type) AS feed_type,
  COUNT(0) AS pages
FROM
  pages
LEFT JOIN
  UNNEST(feed_types) AS feed_type
GROUP BY
  feed_type
ORDER BY
  pages DESC

Here are the results:

feed_type	# pages	% pages
`application/rss+xml`	1,214,353	50%
(none)	1,045,274	43%
`application/atom+xml`	189,469	8%
`application/feed+json`	112	0%

So if the sample is representative, we can say that RSS is used by 50% of pages, Atom on 8%, and JSON Feed on <1%. And about 58% of pages use any feed type at all.

Hope that helps!

Topic		Replies	Views
Where to find http headers in the http archive datasets Analysis	2	1428	April 7, 2022
New Release: `httparchive.crawl` Dataset Meta	1	173	November 20, 2024
Tracking Page Weight Over Time Analysis	3	8566	March 3, 2021
Help finding list of home pages with specific http response header Analysis	7	924	June 7, 2023
Quickstart guide to exploring the HTTP Archive FAQ	0	19186	March 1, 2016

What is the most popular feed format: rss, atom, or JSON?

Related topics