Two questions: how can we identify which websites serve news content, and what CMSs do each of these sites use?
To identify a news site, we could use heuristics like the “news” keyword somewhere in the URL, title, or meta description. I think it’d be more robust (and language-agnostic) to look at structured data. There’s a NewsMediaOrganization @type
that sites could define in their JSON-LD.
The Structured Data chapter of the Web Almanac implemented a custom metric that extracts the values of different types of metadata, including JSON-LD. There’s an existing query we could use to look for this particular @type
but for simplicity we could just find any website that includes the text anywhere in the data object:
SELECT
url
FROM
`httparchive.pages.2022_02_01_mobile`
WHERE
REGEXP_CONTAINS(payload, r'NewsMediaOrganization')
This gives us a list of about 7,000 URLs that include the NewsMediaOrganization keyword in their data object. That seems very low, so I’d love to hear other suggestions for ways to improve the detection.
Assuming this list is representative, we could join the URLs with the technologies
dataset to see which CMSs are used:
SELECT
APPROX_TOP_COUNT(app, 50)
FROM
`httparchive.technologies.2022_02_01_mobile`
WHERE
url IN (
SELECT
url
FROM
`httparchive.pages.2022_02_01_mobile`
WHERE
REGEXP_CONTAINS(payload, r'NewsMediaOrganization'))
AND category = 'CMS'
CMS | Pages | % share |
---|---|---|
WordPress | 4,672 | 93.4% |
Arc XP | 146 | 2.9% |
Drupal | 57 | 1.1% |
Joomla | 22 | 0.4% |
DM Polopoly | 18 | 0.4% |
Botble CMS | 15 | 0.3% |
Adobe Experience Manager | 12 | 0.2% |
Craft CMS | 10 | 0.2% |
Brightspot | 8 | 0.2% |
1C-Bitrix | 8 | 0.2% |
Wix | 6 | 0.1% |
SPIP | 4 | 0.1% |
October CMS | 4 | 0.1% |
Sanity | 3 | 0.1% |
BoldGrid | 3 | 0.1% |
Wagtail | 2 | 0.0% |
Squarespace | 2 | 0.0% |
DataLife Engine | 2 | 0.0% |
Varbase | 1 | 0.0% |
TYPO3 CMS | 1 | 0.0% |
Prismic | 1 | 0.0% |
Methode | 1 | 0.0% |
Kentico CMS | 1 | 0.0% |
HubSpot CMS Hub | 1 | 0.0% |
Total | 5,000 | 100% |
The results show that there are exactly 5,000 sites that use a CMS (70.6%). The most popular CMS is WordPress, which is used by 93.4% of these sites.
The inverse stat is also interesting; about 30% of news sites do not use a known CMS. Could these be homegrown CMSs, headless CMSs, or something else entirely?
Open to any suggestions to improve this analysis, please leave a reply!