How are news sites published?

rviscomi · March 15, 2022, 8:32pm

Two questions: how can we identify which websites serve news content, and what CMSs do each of these sites use?

To identify a news site, we could use heuristics like the “news” keyword somewhere in the URL, title, or meta description. I think it’d be more robust (and language-agnostic) to look at structured data. There’s a NewsMediaOrganization @type that sites could define in their JSON-LD.

The Structured Data chapter of the Web Almanac implemented a custom metric that extracts the values of different types of metadata, including JSON-LD. There’s an existing query we could use to look for this particular @type but for simplicity we could just find any website that includes the text anywhere in the data object:

  SELECT
    url
  FROM
    `httparchive.pages.2022_02_01_mobile`
  WHERE
    REGEXP_CONTAINS(payload, r'NewsMediaOrganization')

This gives us a list of about 7,000 URLs that include the NewsMediaOrganization keyword in their data object. That seems very low, so I’d love to hear other suggestions for ways to improve the detection.

Assuming this list is representative, we could join the URLs with the technologies dataset to see which CMSs are used:

SELECT
  APPROX_TOP_COUNT(app, 50)
FROM
  `httparchive.technologies.2022_02_01_mobile`
WHERE
  url IN (
    SELECT
      url
    FROM
      `httparchive.pages.2022_02_01_mobile`
    WHERE
      REGEXP_CONTAINS(payload, r'NewsMediaOrganization'))
  AND category = 'CMS'

CMS	Pages	% share
WordPress	4,672	93.4%
Arc XP	146	2.9%
Drupal	57	1.1%
Joomla	22	0.4%
DM Polopoly	18	0.4%
Botble CMS	15	0.3%
Adobe Experience Manager	12	0.2%
Craft CMS	10	0.2%
Brightspot	8	0.2%
1C-Bitrix	8	0.2%
Wix	6	0.1%
SPIP	4	0.1%
October CMS	4	0.1%
Sanity	3	0.1%
BoldGrid	3	0.1%
Wagtail	2	0.0%
Squarespace	2	0.0%
DataLife Engine	2	0.0%
Varbase	1	0.0%
TYPO3 CMS	1	0.0%
Prismic	1	0.0%
Methode	1	0.0%
Kentico CMS	1	0.0%
HubSpot CMS Hub	1	0.0%
Total	5,000	100%

The results show that there are exactly 5,000 sites that use a CMS (70.6%). The most popular CMS is WordPress, which is used by 93.4% of these sites.

The inverse stat is also interesting; about 30% of news sites do not use a known CMS. Could these be homegrown CMSs, headless CMSs, or something else entirely?

Open to any suggestions to improve this analysis, please leave a reply!

rviscomi · March 15, 2022, 8:52pm

Added a bit more complexity to the query to generate the % stats and added NewsArticle to the regex, which seems to be the more popular way to denote a news page.

SELECT
  app AS cms,
  COUNT(0) AS pages,
  COUNT(0) / SUM(COUNT(0)) OVER () AS pct,
  COUNTIF(app IS NOT NULL) / SUM(COUNTIF(app IS NOT NULL)) OVER () AS pct_of_cms
FROM (
  SELECT
    url
  FROM
    `httparchive.pages.2022_02_01_mobile`
  WHERE
    REGEXP_CONTAINS(payload, r'(NewsMediaOrganization|NewsArticle)'))
LEFT JOIN (
  SELECT
    url,
    app
  FROM
   `httparchive.technologies.2022_02_01_mobile`
  WHERE
    category = 'CMS')
USING
  (url)
GROUP BY
  cms
ORDER BY
  pages DESC

CMS	Pages	% of total	% of CMS
(none)	12,475	54.1%	0.0%
WordPress	9,176	39.8%	86.6%
Drupal	273	1.2%	2.6%
DNN	189	0.8%	1.8%
Arc XP	157	0.7%	1.5%
Joomla	115	0.5%	1.1%
1C-Bitrix	104	0.5%	1.0%
TYPO3 CMS	93	0.4%	0.9%
Tilda	39	0.2%	0.4%
Duopana	34	0.1%	0.3%
Adobe Experience Manager	33	0.1%	0.3%
Kentico CMS	30	0.1%	0.3%
Plone	29	0.1%	0.3%
Microsoft SharePoint	26	0.1%	0.2%
Craft CMS	25	0.1%	0.2%
SPIP	24	0.1%	0.2%
Sitecore	19	0.1%	0.2%
DM Polopoly	18	0.1%	0.2%
DataLife Engine	18	0.1%	0.2%
Wix	16	0.1%	0.2%
Botble CMS	15	0.1%	0.1%
Contentful	14	0.1%	0.1%
Jalios	13	0.1%	0.1%
October CMS	11	0.0%	0.1%
Liferay	10	0.0%	0.1%
Contao	9	0.0%	0.1%
Wagtail	8	0.0%	0.1%
Brightspot	8	0.0%	0.1%
Sitefinity	8	0.0%	0.1%
ExpressionEngine	7	0.0%	0.1%
MODX	6	0.0%	0.1%
HubSpot CMS Hub	5	0.0%	0.0%
Methode	5	0.0%	0.0%
Dynamicweb	4	0.0%	0.0%
Smartstore Page Builder	4	0.0%	0.0%
Squarespace	4	0.0%	0.0%
Sanity	3	0.0%	0.0%
Megagroup CMS.S3	3	0.0%	0.0%
Thelia	3	0.0%	0.0%
SDL Tridion	3	0.0%	0.0%
BoldGrid	3	0.0%	0.0%
Jahia DX	2	0.0%	0.0%
Neos CMS	2	0.0%	0.0%
Orchard CMS	2	0.0%	0.0%
Pimcore	2	0.0%	0.0%
Django CMS	2	0.0%	0.0%
Contensis	2	0.0%	0.0%
Statamic	2	0.0%	0.0%
SilverStripe	2	0.0%	0.0%
Prismic	2	0.0%	0.0%
Concrete CMS	2	0.0%	0.0%
Bolt CMS	1	0.0%	0.0%
Varbase	1	0.0%	0.0%
Ghost	1	0.0%	0.0%
Proximis Unified Commerce	1	0.0%	0.0%
Roadiz CMS	1	0.0%	0.0%
webEdition	1	0.0%	0.0%
eZ Platform	1	0.0%	0.0%
eZ Publish	1	0.0%	0.0%
Total	23,067	100.0%	100.0%

tunetheweb · March 16, 2022, 9:38am

I think our Homepage-only limitations may be impacting us here as many news websites I’ve checked (Guardian, CNN, NBCNews, Fox News) use neither NewsMediaOrganization nor NewsArticle on the home page, though many do on the articles themselves.

tomvangoethem · March 16, 2022, 10:21am

You could try to complement the set of news websites with information from Wikidata. For instance to get all news websites, I think this query works (disclaimer: I never queried Wikidata before):

SELECT ?item ?itemLabel ?website
WHERE 
{
  ?item wdt:P31/wdt:P279* wd:Q11032. # Must be a (subclass of) newspaper
  ?item wdt:P856 ?website # get the official website
        
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

That list (9,761 results) probably needs to be filtered as it contains some odd values - taking the URLs without a path seems like an adequate filter. After filtering and matching with sites in HTTP Archive, I get 4,054 unique news sites:

SELECT
  COUNT(DISTINCT url) AS total,
FROM
  `<mydb>.temp.newssites`
INNER JOIN
  `httparchive.pages.2022_02_01_mobile`
ON NET.HOST(website) = NET.HOST(url)
WHERE
  REGEXP_CONTAINS(website, r'^https?://[^/]+/?$')

Topic		Replies	Views
Higher level stats for structured data Analysis	14	697	November 22, 2023
Structured Data adoption Analysis	4	1712	June 3, 2018
List of domains using a given technology Analysis	2	360	May 23, 2024
What percent of WordPress websites contain structured data? Analysis	2	2629	September 3, 2020
How and where is document.write used on the web? Analysis	11	5361	July 6, 2017

How are news sites published?

Related topics