Hi all.
I am new to HTTP Archive and BigQuery world. I am trying to determine the popularity of feed formats (rss/atom/json)?
First idea that comes to my mind is to scan the body
of httparchive.response_bodies
for application/rss+xml
, application/atom+xml
or application/feed+json
. Unfortunately that will require to scan of 40+ TBs of data and it may cost me fortune
Can you please help me to find a solution for my problem, without spending fortune please? I am trying to determine 2 things.
- How many websites use feeds at all.
- How popular individual formats are (rss/atom/json).
Your help will be well appreciated.
1 Like
Hi @pawelgrzybek, welcome to the HTTP Archive community!
The size of the dataset may seem intimidating at first, but there are a few techniques to minimize the amount of processing and costs involved.
- Use the
httparchive.all.requests
table with WHERE is_main_document
set to ignore everything that’s not an HTML response
- Use
WHERE is_root_page
to ignore secondary pages, which I’m guessing are usually similar to home (root) pages in terms of feed metadata
- Use
WHERE client = 'mobile'
(or 'desktop'
) assuming there are no major differences between clients
- Use
TABLESAMPLE
to reduce the number of rows processed, ie TABLESAMPLE SYSTEM (10 PERCENT)
will “only” process 234 GB (everyone gets 1 TB free per month)
- Set up cost controls to ensure you never pay a cent out of pocket unexpectedly
All that said, here’s a query you can run that is 234 GB and will give you a sample of ~10% of pages:
WITH pages AS (
SELECT
REGEXP_EXTRACT_ALL(response_body, r'(?i)(application/rss\+xml|application/atom\+xml|application/feed\+json)') AS feed_types
FROM
`httparchive.all.requests` TABLESAMPLE SYSTEM (10 PERCENT)
WHERE
date = '2023-05-01' AND
client = 'mobile' AND
is_main_document AND
is_root_page
)
SELECT
LOWER(feed_type) AS feed_type,
COUNT(0) AS pages
FROM
pages
LEFT JOIN
UNNEST(feed_types) AS feed_type
GROUP BY
feed_type
ORDER BY
pages DESC
Here are the results:
feed_type |
# pages |
% pages |
application/rss+xml |
1,214,353 |
50% |
(none) |
1,045,274 |
43% |
application/atom+xml |
189,469 |
8% |
application/feed+json |
112 |
0% |
So if the sample is representative, we can say that RSS is used by 50% of pages, Atom on 8%, and JSON Feed on <1%. And about 58% of pages use any feed type at all.
Hope that helps!
I was looking at this too, and was using a different approach.
I’m not sure how valid this assumption is but if sites use this format (like mines does!):
<link rel="alternate" type="application/rss+xml" title="TuneTheWeb RSS feed" href="/rss.xml" />
Then we grab all the <link>
attributes in a custom metric so can query just that, without looking at the whole, expensive, response bodies:
CREATE TEMPORARY FUNCTION getLinksByType(payload STRING)
RETURNS ARRAY < STRUCT < name STRING, href STRING, type STRING >>
LANGUAGE js AS '''
var types = new Set(['alternate']);
try {
var almanac = JSON.parse(payload);
return almanac['link-nodes'].nodes.reduce((results, link) => {
var type = link.rel.toLowerCase();
if (!types.has(type)) {
return results;
}
results.push({
name: type,
href: link.href,
type: link.type
});
return results;
}, []);
} catch (e) {
return [];
}
''';
SELECT
client,
type,
COUNT(DISTINCT page) AS pages,
total,
COUNT(DISTINCT page) / total AS pct_pages
FROM (
SELECT DISTINCT
client,
page,
link.type AS type
FROM
`httparchive.all.pages`
LEFT JOIN
UNNEST(getLinksByType(JSON_QUERY(custom_metrics, '$.almanac'))) AS link
WHERE
date = '2023-04-01' AND
is_root_page AND
client = 'mobile'
)
INNER JOIN (
SELECT
client,
count(DISTINCT page) as total
FROM
`httparchive.all.pages`
WHERE
date = '2023-04-01' AND
is_root_page AND
client = 'mobile'
GROUP BY
client
)
USING (client)
GROUP BY
client,
type,
total
ORDER BY
pct_hints DESC
That costs 1.88 TB (for mobile only) but does query the full dataset. Or you could limit it by adding to top million sites with AND rank <= 1000000
, or by sampling as Rick said above. Pro tip - set rank = 1000
while testing your query as makes it much cheaper!
Anywhere here’s the top 10 results of that query:
client |
type |
pages |
total |
pct_pages |
mobile |
|
10,494,367 |
16,539,169 |
63.45% |
mobile |
application/rss+xml |
6,250,048 |
16,539,169 |
37.79% |
mobile |
application/json+oembed |
4,334,594 |
16,539,169 |
26.21% |
mobile |
text/xml+oembed |
4,279,566 |
16,539,169 |
25.88% |
mobile |
application/json |
3,866,316 |
16,539,169 |
23.38% |
mobile |
application/atom+xml |
750,083 |
16,539,169 |
4.54% |
mobile |
text/calendar |
155,971 |
16,539,169 |
0.94% |
mobile |
text/html |
48,519 |
16,539,169 |
0.29% |
mobile |
application/xml+oembed |
42,638 |
16,539,169 |
0.26% |
mobile |
text/xml |
22,330 |
16,539,169 |
0.14% |
(note some pages may have multiple <link rel="alternate" ...>
tags)
Nice! I forgot about the link-nodes
custom metric, that should be way more accurate.
FWIW the equivalent query over the all.requests
table isn’t that much more expensive:
/* This query will process 2.24 TB when run. */
SELECT
response_body
FROM
`httparchive.all.requests`
WHERE
date = '2023-04-01' AND
client = 'mobile' AND
is_main_document AND
is_root_page