Two questions: how can we identify which websites serve news content, and what CMSs do each of these sites use?
To identify a news site, we could use heuristics like the “news” keyword somewhere in the URL, title, or meta description. I think it’d be more robust (and language-agnostic) to look at structured data. There’s a NewsMediaOrganization
@type that sites could define in their JSON-LD.
The Structured Data chapter of the Web Almanac implemented a custom metric that extracts the values of different types of metadata, including JSON-LD. There’s an existing query we could use to look for this particular
@type but for simplicity we could just find any website that includes the text anywhere in the data object:
SELECT url FROM `httparchive.pages.2022_02_01_mobile` WHERE REGEXP_CONTAINS(payload, r'NewsMediaOrganization')
This gives us a list of about 7,000 URLs that include the NewsMediaOrganization keyword in their data object. That seems very low, so I’d love to hear other suggestions for ways to improve the detection.
Assuming this list is representative, we could join the URLs with the
technologies dataset to see which CMSs are used:
SELECT APPROX_TOP_COUNT(app, 50) FROM `httparchive.technologies.2022_02_01_mobile` WHERE url IN ( SELECT url FROM `httparchive.pages.2022_02_01_mobile` WHERE REGEXP_CONTAINS(payload, r'NewsMediaOrganization')) AND category = 'CMS'
|Adobe Experience Manager||12||0.2%|
|HubSpot CMS Hub||1||0.0%|
The results show that there are exactly 5,000 sites that use a CMS (70.6%). The most popular CMS is WordPress, which is used by 93.4% of these sites.
The inverse stat is also interesting; about 30% of news sites do not use a known CMS. Could these be homegrown CMSs, headless CMSs, or something else entirely?
Open to any suggestions to improve this analysis, please leave a reply!