What is the most popular feed format: rss, atom, or JSON?

Hi all.

I am new to HTTP Archive and BigQuery world. I am trying to determine the popularity of feed formats (rss/atom/json)?

First idea that comes to my mind is to scan the body of httparchive.response_bodies for application/rss+xml, application/atom+xml or application/feed+json. Unfortunately that will require to scan of 40+ TBs of data and it may cost me fortune :frowning:

Can you please help me to find a solution for my problem, without spending fortune please? I am trying to determine 2 things.

  1. How many websites use feeds at all.
  2. How popular individual formats are (rss/atom/json).

Your help will be well appreciated.

1 Like

Hi @pawelgrzybek, welcome to the HTTP Archive community! :wave:

The size of the dataset may seem intimidating at first, but there are a few techniques to minimize the amount of processing and costs involved.

  • Use the httparchive.all.requests table with WHERE is_main_document set to ignore everything that’s not an HTML response
  • Use WHERE is_root_page to ignore secondary pages, which I’m guessing are usually similar to home (root) pages in terms of feed metadata
  • Use WHERE client = 'mobile' (or 'desktop') assuming there are no major differences between clients
  • Use TABLESAMPLE to reduce the number of rows processed, ie TABLESAMPLE SYSTEM (10 PERCENT) will “only” process 234 GB (everyone gets 1 TB free per month)
  • Set up cost controls to ensure you never pay a cent out of pocket unexpectedly

All that said, here’s a query you can run that is 234 GB and will give you a sample of ~10% of pages:

WITH pages AS (
  SELECT
    REGEXP_EXTRACT_ALL(response_body, r'(?i)(application/rss\+xml|application/atom\+xml|application/feed\+json)') AS feed_types
  FROM
    `httparchive.all.requests` TABLESAMPLE SYSTEM (10 PERCENT)
  WHERE
    date = '2023-05-01' AND
    client = 'mobile' AND
    is_main_document AND
    is_root_page
)

SELECT
  LOWER(feed_type) AS feed_type,
  COUNT(0) AS pages
FROM
  pages
LEFT JOIN
  UNNEST(feed_types) AS feed_type
GROUP BY
  feed_type
ORDER BY
  pages DESC

Here are the results:

feed_type # pages % pages
application/rss+xml 1,214,353 50%
(none) 1,045,274 43%
application/atom+xml 189,469 8%
application/feed+json 112 0%

So if the sample is representative, we can say that RSS is used by 50% of pages, Atom on 8%, and JSON Feed on <1%. And about 58% of pages use any feed type at all.

Hope that helps!

I was looking at this too, and was using a different approach.

I’m not sure how valid this assumption is but if sites use this format (like mines does!):

<link rel="alternate" type="application/rss+xml" title="TuneTheWeb RSS feed" href="/rss.xml" />

Then we grab all the <link> attributes in a custom metric so can query just that, without looking at the whole, expensive, response bodies:

CREATE TEMPORARY FUNCTION getLinksByType(payload STRING)
RETURNS ARRAY < STRUCT < name STRING, href STRING, type STRING >>
LANGUAGE js AS '''
var types = new Set(['alternate']);
try {
    var almanac = JSON.parse(payload);
    return almanac['link-nodes'].nodes.reduce((results, link) => {
        var type = link.rel.toLowerCase();
        if (!types.has(type)) {
            return results;
        }
        results.push({
            name: type,
            href: link.href,
            type: link.type
        });
        return results;
    }, []);
} catch (e) {
    return [];
}
''';
SELECT
  client,
  type,
  COUNT(DISTINCT page) AS pages,
  total,
  COUNT(DISTINCT page) / total AS pct_pages
FROM (
  SELECT DISTINCT
    client,
    page,
    link.type AS type
  FROM
    `httparchive.all.pages`
  LEFT JOIN
    UNNEST(getLinksByType(JSON_QUERY(custom_metrics, '$.almanac'))) AS link
  WHERE
    date = '2023-04-01' AND
    is_root_page AND
    client = 'mobile'
)
INNER JOIN (
  SELECT
    client,
    count(DISTINCT page) as total
  FROM
    `httparchive.all.pages`
  WHERE
    date = '2023-04-01' AND
    is_root_page AND
    client = 'mobile'
  GROUP BY
    client
)
USING (client)
GROUP BY
  client,
  type,
  total
ORDER BY
  pct_hints DESC

That costs 1.88 TB (for mobile only) but does query the full dataset. Or you could limit it by adding to top million sites with AND rank <= 1000000, or by sampling as Rick said above. Pro tip - set rank = 1000 while testing your query as makes it much cheaper!

Anywhere here’s the top 10 results of that query:

client type pages total pct_pages
mobile 10,494,367 16,539,169 63.45%
mobile application/rss+xml 6,250,048 16,539,169 37.79%
mobile application/json+oembed 4,334,594 16,539,169 26.21%
mobile text/xml+oembed 4,279,566 16,539,169 25.88%
mobile application/json 3,866,316 16,539,169 23.38%
mobile application/atom+xml 750,083 16,539,169 4.54%
mobile text/calendar 155,971 16,539,169 0.94%
mobile text/html 48,519 16,539,169 0.29%
mobile application/xml+oembed 42,638 16,539,169 0.26%
mobile text/xml 22,330 16,539,169 0.14%

(note some pages may have multiple <link rel="alternate" ...> tags)

Nice! I forgot about the link-nodes custom metric, that should be way more accurate.

FWIW the equivalent query over the all.requests table isn’t that much more expensive:

/* This query will process 2.24 TB when run. */
SELECT
  response_body
FROM
  `httparchive.all.requests`
WHERE
  date = '2023-04-01' AND
  client = 'mobile' AND
  is_main_document AND
  is_root_page