How are news sites published?

Two questions: how can we identify which websites serve news content, and what CMSs do each of these sites use?

To identify a news site, we could use heuristics like the “news” keyword somewhere in the URL, title, or meta description. I think it’d be more robust (and language-agnostic) to look at structured data. There’s a NewsMediaOrganization @type that sites could define in their JSON-LD.

The Structured Data chapter of the Web Almanac implemented a custom metric that extracts the values of different types of metadata, including JSON-LD. There’s an existing query we could use to look for this particular @type but for simplicity we could just find any website that includes the text anywhere in the data object:

  SELECT
    url
  FROM
    `httparchive.pages.2022_02_01_mobile`
  WHERE
    REGEXP_CONTAINS(payload, r'NewsMediaOrganization')

This gives us a list of about 7,000 URLs that include the NewsMediaOrganization keyword in their data object. That seems very low, so I’d love to hear other suggestions for ways to improve the detection.

Assuming this list is representative, we could join the URLs with the technologies dataset to see which CMSs are used:

SELECT
  APPROX_TOP_COUNT(app, 50)
FROM
  `httparchive.technologies.2022_02_01_mobile`
WHERE
  url IN (
    SELECT
      url
    FROM
      `httparchive.pages.2022_02_01_mobile`
    WHERE
      REGEXP_CONTAINS(payload, r'NewsMediaOrganization'))
  AND category = 'CMS'
CMS Pages % share
WordPress 4,672 93.4%
Arc XP 146 2.9%
Drupal 57 1.1%
Joomla 22 0.4%
DM Polopoly 18 0.4%
Botble CMS 15 0.3%
Adobe Experience Manager 12 0.2%
Craft CMS 10 0.2%
Brightspot 8 0.2%
1C-Bitrix 8 0.2%
Wix 6 0.1%
SPIP 4 0.1%
October CMS 4 0.1%
Sanity 3 0.1%
BoldGrid 3 0.1%
Wagtail 2 0.0%
Squarespace 2 0.0%
DataLife Engine 2 0.0%
Varbase 1 0.0%
TYPO3 CMS 1 0.0%
Prismic 1 0.0%
Methode 1 0.0%
Kentico CMS 1 0.0%
HubSpot CMS Hub 1 0.0%
Total 5,000 100%

The results show that there are exactly 5,000 sites that use a CMS (70.6%). The most popular CMS is WordPress, which is used by 93.4% of these sites.

The inverse stat is also interesting; about 30% of news sites do not use a known CMS. Could these be homegrown CMSs, headless CMSs, or something else entirely?

Open to any suggestions to improve this analysis, please leave a reply!

Added a bit more complexity to the query to generate the % stats and added NewsArticle to the regex, which seems to be the more popular way to denote a news page.

SELECT
  app AS cms,
  COUNT(0) AS pages,
  COUNT(0) / SUM(COUNT(0)) OVER () AS pct,
  COUNTIF(app IS NOT NULL) / SUM(COUNTIF(app IS NOT NULL)) OVER () AS pct_of_cms
FROM (
  SELECT
    url
  FROM
    `httparchive.pages.2022_02_01_mobile`
  WHERE
    REGEXP_CONTAINS(payload, r'(NewsMediaOrganization|NewsArticle)'))
LEFT JOIN (
  SELECT
    url,
    app
  FROM
   `httparchive.technologies.2022_02_01_mobile`
  WHERE
    category = 'CMS')
USING
  (url)
GROUP BY
  cms
ORDER BY
  pages DESC
CMS Pages % of total % of CMS
(none) 12,475 54.1% 0.0%
WordPress 9,176 39.8% 86.6%
Drupal 273 1.2% 2.6%
DNN 189 0.8% 1.8%
Arc XP 157 0.7% 1.5%
Joomla 115 0.5% 1.1%
1C-Bitrix 104 0.5% 1.0%
TYPO3 CMS 93 0.4% 0.9%
Tilda 39 0.2% 0.4%
Duopana 34 0.1% 0.3%
Adobe Experience Manager 33 0.1% 0.3%
Kentico CMS 30 0.1% 0.3%
Plone 29 0.1% 0.3%
Microsoft SharePoint 26 0.1% 0.2%
Craft CMS 25 0.1% 0.2%
SPIP 24 0.1% 0.2%
Sitecore 19 0.1% 0.2%
DM Polopoly 18 0.1% 0.2%
DataLife Engine 18 0.1% 0.2%
Wix 16 0.1% 0.2%
Botble CMS 15 0.1% 0.1%
Contentful 14 0.1% 0.1%
Jalios 13 0.1% 0.1%
October CMS 11 0.0% 0.1%
Liferay 10 0.0% 0.1%
Contao 9 0.0% 0.1%
Wagtail 8 0.0% 0.1%
Brightspot 8 0.0% 0.1%
Sitefinity 8 0.0% 0.1%
ExpressionEngine 7 0.0% 0.1%
MODX 6 0.0% 0.1%
HubSpot CMS Hub 5 0.0% 0.0%
Methode 5 0.0% 0.0%
Dynamicweb 4 0.0% 0.0%
Smartstore Page Builder 4 0.0% 0.0%
Squarespace 4 0.0% 0.0%
Sanity 3 0.0% 0.0%
Megagroup CMS.S3 3 0.0% 0.0%
Thelia 3 0.0% 0.0%
SDL Tridion 3 0.0% 0.0%
BoldGrid 3 0.0% 0.0%
Jahia DX 2 0.0% 0.0%
Neos CMS 2 0.0% 0.0%
Orchard CMS 2 0.0% 0.0%
Pimcore 2 0.0% 0.0%
Django CMS 2 0.0% 0.0%
Contensis 2 0.0% 0.0%
Statamic 2 0.0% 0.0%
SilverStripe 2 0.0% 0.0%
Prismic 2 0.0% 0.0%
Concrete CMS 2 0.0% 0.0%
Bolt CMS 1 0.0% 0.0%
Varbase 1 0.0% 0.0%
Ghost 1 0.0% 0.0%
Proximis Unified Commerce 1 0.0% 0.0%
Roadiz CMS 1 0.0% 0.0%
webEdition 1 0.0% 0.0%
eZ Platform 1 0.0% 0.0%
eZ Publish 1 0.0% 0.0%
Total 23,067 100.0% 100.0%

I think our Homepage-only limitations may be impacting us here as many news websites I’ve checked (Guardian, CNN, NBCNews, Fox News) use neither NewsMediaOrganization nor NewsArticle on the home page, though many do on the articles themselves.

You could try to complement the set of news websites with information from Wikidata. For instance to get all news websites, I think this query works (disclaimer: I never queried Wikidata before):

SELECT ?item ?itemLabel ?website
WHERE 
{
  ?item wdt:P31/wdt:P279* wd:Q11032. # Must be a (subclass of) newspaper
  ?item wdt:P856 ?website # get the official website
        
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

That list (9,761 results) probably needs to be filtered as it contains some odd values - taking the URLs without a path seems like an adequate filter. After filtering and matching with sites in HTTP Archive, I get 4,054 unique news sites:

SELECT
  COUNT(DISTINCT url) AS total,
FROM
  `<mydb>.temp.newssites`
INNER JOIN
  `httparchive.pages.2022_02_01_mobile`
ON NET.HOST(website) = NET.HOST(url)
WHERE
  REGEXP_CONTAINS(website, r'^https?://[^/]+/?$')
1 Like