What is the most popular feed format: rss, atom, or JSON?

pawelgrzybek · May 26, 2023, 2:59pm

Hi all.

I am new to HTTP Archive and BigQuery world. I am trying to determine the popularity of feed formats (rss/atom/json)?

First idea that comes to my mind is to scan the body of httparchive.response_bodies for application/rss+xml, application/atom+xml or application/feed+json. Unfortunately that will require to scan of 40+ TBs of data and it may cost me fortune

Can you please help me to find a solution for my problem, without spending fortune please? I am trying to determine 2 things.

How many websites use feeds at all.
How popular individual formats are (rss/atom/json).

Your help will be well appreciated.

rviscomi · May 26, 2023, 3:16pm

Hi @pawelgrzybek, welcome to the HTTP Archive community!

The size of the dataset may seem intimidating at first, but there are a few techniques to minimize the amount of processing and costs involved.

Use the httparchive.all.requests table with WHERE is_main_document set to ignore everything that’s not an HTML response
Use WHERE is_root_page to ignore secondary pages, which I’m guessing are usually similar to home (root) pages in terms of feed metadata
Use WHERE client = 'mobile' (or 'desktop') assuming there are no major differences between clients
Use TABLESAMPLE to reduce the number of rows processed, ie TABLESAMPLE SYSTEM (10 PERCENT) will “only” process 234 GB (everyone gets 1 TB free per month)
Set up cost controls to ensure you never pay a cent out of pocket unexpectedly

All that said, here’s a query you can run that is 234 GB and will give you a sample of ~10% of pages:

WITH pages AS (
  SELECT
    REGEXP_EXTRACT_ALL(response_body, r'(?i)(application/rss\+xml|application/atom\+xml|application/feed\+json)') AS feed_types
  FROM
    `httparchive.all.requests` TABLESAMPLE SYSTEM (10 PERCENT)
  WHERE
    date = '2023-05-01' AND
    client = 'mobile' AND
    is_main_document AND
    is_root_page
)

SELECT
  LOWER(feed_type) AS feed_type,
  COUNT(0) AS pages
FROM
  pages
LEFT JOIN
  UNNEST(feed_types) AS feed_type
GROUP BY
  feed_type
ORDER BY
  pages DESC

Here are the results:

feed_type	# pages	% pages
`application/rss+xml`	1,214,353	50%
(none)	1,045,274	43%
`application/atom+xml`	189,469	8%
`application/feed+json`	112	0%

So if the sample is representative, we can say that RSS is used by 50% of pages, Atom on 8%, and JSON Feed on <1%. And about 58% of pages use any feed type at all.

Hope that helps!

tunetheweb · May 26, 2023, 4:23pm

I was looking at this too, and was using a different approach.

I’m not sure how valid this assumption is but if sites use this format (like mines does!):

<link rel="alternate" type="application/rss+xml" title="TuneTheWeb RSS feed" href="/rss.xml" />

Then we grab all the <link> attributes in a custom metric so can query just that, without looking at the whole, expensive, response bodies:

CREATE TEMPORARY FUNCTION getLinksByType(payload STRING)
RETURNS ARRAY < STRUCT < name STRING, href STRING, type STRING >>
LANGUAGE js AS '''
var types = new Set(['alternate']);
try {
    var almanac = JSON.parse(payload);
    return almanac['link-nodes'].nodes.reduce((results, link) => {
        var type = link.rel.toLowerCase();
        if (!types.has(type)) {
            return results;
        }
        results.push({
            name: type,
            href: link.href,
            type: link.type
        });
        return results;
    }, []);
} catch (e) {
    return [];
}
''';
SELECT
  client,
  type,
  COUNT(DISTINCT page) AS pages,
  total,
  COUNT(DISTINCT page) / total AS pct_pages
FROM (
  SELECT DISTINCT
    client,
    page,
    link.type AS type
  FROM
    `httparchive.all.pages`
  LEFT JOIN
    UNNEST(getLinksByType(JSON_QUERY(custom_metrics, '$.almanac'))) AS link
  WHERE
    date = '2023-04-01' AND
    is_root_page AND
    client = 'mobile'
)
INNER JOIN (
  SELECT
    client,
    count(DISTINCT page) as total
  FROM
    `httparchive.all.pages`
  WHERE
    date = '2023-04-01' AND
    is_root_page AND
    client = 'mobile'
  GROUP BY
    client
)
USING (client)
GROUP BY
  client,
  type,
  total
ORDER BY
  pct_hints DESC

That costs 1.88 TB (for mobile only) but does query the full dataset. Or you could limit it by adding to top million sites with AND rank <= 1000000, or by sampling as Rick said above. Pro tip - set rank = 1000 while testing your query as makes it much cheaper!

Anywhere here’s the top 10 results of that query:

client	type	pages	total	pct_pages
mobile		10,494,367	16,539,169	63.45%
mobile	application/rss+xml	6,250,048	16,539,169	37.79%
mobile	application/json+oembed	4,334,594	16,539,169	26.21%
mobile	text/xml+oembed	4,279,566	16,539,169	25.88%
mobile	application/json	3,866,316	16,539,169	23.38%
mobile	application/atom+xml	750,083	16,539,169	4.54%
mobile	text/calendar	155,971	16,539,169	0.94%
mobile	text/html	48,519	16,539,169	0.29%
mobile	application/xml+oembed	42,638	16,539,169	0.26%
mobile	text/xml	22,330	16,539,169	0.14%

(note some pages may have multiple <link rel="alternate" ...> tags)

rviscomi · May 27, 2023, 1:01am

Nice! I forgot about the link-nodes custom metric, that should be way more accurate.

FWIW the equivalent query over the all.requests table isn’t that much more expensive:

/* This query will process 2.24 TB when run. */
SELECT
  response_body
FROM
  `httparchive.all.requests`
WHERE
  date = '2023-04-01' AND
  client = 'mobile' AND
  is_main_document AND
  is_root_page

Topic		Replies	Views
Where to find http headers in the http archive datasets Analysis	2	1419	April 7, 2022
New Release: `httparchive.crawl` Dataset Meta	1	161	November 20, 2024
Tracking Page Weight Over Time Analysis	3	8562	March 3, 2021
Help finding list of home pages with specific http response header Analysis	7	924	June 7, 2023
Quickstart guide to exploring the HTTP Archive FAQ	0	19136	March 1, 2016

What is the most popular feed format: rss, atom, or JSON?

Related topics