Higher level stats for structured data

I looked up the Structured Data chapter to cite data on what % of sites use structured data at all, and what % specify any type of metadata like description, image etc, but it’s currently far more granular than that, missing the forest for the trees a bit.

Higher level stats I think would be useful (and have needed myself):

  • Here there is a graph that shows what % of sites use each structured data tech. What is the union? How many websites use structured data? One would expect the chapter to start with this.
  • Same with social media sharing metadata (OG, Twitter, etc.). There is a detailed breakdown, but no stats on the union: how many websites specify any social media sharing metadata? This would be very useful for any API that consumes URLs and fetches metadata. This could also be broken down by type of metadata: image, description, etc. (title wouldn’t be that useful, because technically nearly every website specifies a <title>)
  • Individual technologies are broken down by type of data (e.g. JSON-LD) but it would be more useful to see this across all structured data technologies: what types of data (entities, properties) do websites need to specify most frequently?

A significant issue is that, at that time, the report was only based on home page data. Meaning it is missing the majority of pages that typically have structured data on them.

I believe the crawl does now include more than the home pages, but even then, it would be a limited snapshot to work from.

I’m aware (I’ve been chapter lead and co-analyst for the CSS chapter in the 2020 Almanac), but that’s a confound that applies to all stats.

Seeing how many homepages have these is still quite useful, especially since there are services (e.g. iframely) that just give up if no structured data is present.

Took a quick look at the first question, using the latest October 2023 dataset and considering a page as having structured data if any of the detections in the rendered.present object are positive. And I count a site as having structured data if either its home page or (optional) secondary page have a positive detection.

Edit: there’s a bug in this query. See @tunetheweb’s reply below for the correct version.

CREATE TEMP FUNCTION HAS_STRUCTURED_DATA(data STRING) RETURNS BOOLEAN LANGUAGE js AS '''
try {
  return Object.values(data).some(i => i);
} catch {
  return false;
}
''';

WITH sd AS (
  SELECT
    client,
    root_page,
    JSON_QUERY(custom_metrics, '$.structured-data.structured_data.rendered.present') AS data,
    HAS_STRUCTURED_DATA(JSON_QUERY(custom_metrics, '$.structured-data.structured_data.rendered.present')) AS has_structured_data
  FROM
    `httparchive.all.pages`
  WHERE
    date = '2023-10-01'
)

SELECT
  client,
  COUNT(DISTINCT IF(has_structured_data, root_page, NULL)) / COUNT(DISTINCT root_page) AS pct
FROM
  sd
GROUP BY
  client

Results:

client pct
mobile 99.7%
desktop 99.8%

This seems high considering that no individual structured data type is found on more than 62% of home pages (according to the 2022 chapter) but not impossible if for example sites tend to exclusively use either RDFa or Open Graph.

I seem to recall something really common “technically” counted as one of the structured data types.

Something like a meta tag with a property value was technically RDFa, or something like that.

So that would maybe explain why so high.

Please fixed this issue

The RDFa check for 2022 is for these attributes: [vocab],[typeof],[property],[resource],[prefix].

I’d classify RDFa as structured data, but maybe meta tags could be excluded. A modification to Rick’s HAS_STRUCTURED_DATA function could do that. Or even a query grouped by types.

It would be interesting to see the current home page numbers compared to secondary pages. Maybe there has been a big growth in OG tags.

1 Like

Thank you Rick for looking into it! There is no way 99.8% could be accurate, we should dig in further to see what’s gaming it.

If this is actually what is being tested (we should check!):

The RDFa check for 2022 is for these attributes: [vocab],[typeof],[property],[resource],[prefix]

We could look into what attribute-value pair is most popular, and hopefully that should show us what the problem is straight away (and is an interesting stat in itself).

I’d classify RDFa as structured data, but maybe meta tags could be excluded.

RDFa is absolutely structured data! But in RDFa you can use a <meta> in RDFa for any data you don’t want to visually render, so excluding them would exclude a ton of true positives. I think it’s probably a very specific type of metadata that is messing with this, so digging in further will reveal it.

1 Like

A lot of OG implementations use a meta tag with the ‘property’ attribute, which looks like it would register as RDFa.

It would register as RDFa because it is RDFa. But per the chart above, OG is used in 59% of websites, not 99.8%. 99.8% is absurd, it’s even higher than the number of websites that use an <html>, <head>, <body> or <title> element!

1 Like

Agreed.

I found a bug in Rick’s original query above. We need to JSON.parse the string or it’s just an array of letters and returns true for any non-null values.

I.e.

{"json_ld":false,"microdata":false...}

Is stringified to the following string array:

[{,",j,s,o,n,_,l,d,",:,f,a,l,s,e,,,",m,i,c,r,o,d,a,t,a,",:,f,a,l,s,e,...}]

which always returns True for some(i => i) except when the string is null (i.e. when the custom metric failed completely).

Using JSON.parse to convert the passed in string back to an Object to allow Object.values to work properly works:

CREATE TEMP FUNCTION HAS_STRUCTURED_DATA(data STRING) RETURNS BOOLEAN LANGUAGE js AS '''
try {
  return Object.values(JSON.parse(data)).some(i => i);
} catch {
  return false;
}
''';

WITH sd AS (
  SELECT
    client,
    root_page,
    JSON_QUERY(custom_metrics, '$.structured-data.structured_data.rendered.present') AS data,
    HAS_STRUCTURED_DATA(JSON_QUERY(custom_metrics, '$.structured-data.structured_data.rendered.present')) AS has_structured_data
  FROM
    `httparchive.all.pages`
  WHERE
    date = '2023-10-01'
)


SELECT
  client,
  COUNT(DISTINCT IF(has_structured_data, root_page, NULL)) / COUNT(DISTINCT root_page) AS pct
FROM
  sd
GROUP BY
  client

It gives the following results, which seems much more reasonable and in keeping with the Web Almanac’s split out results:

client pct
desktop 71.8%
mobile 72.6%

Still trying to recall what that really common was that I was thinking about and whether we fixed it or not… <meta... property="..."> might have been it (in which case it’s still there, and it’s correct that it is) but thought it was something even more common than that (most regularly used <meta> elements use name and content attributes rather than property).

3 Likes

D’oh, good catch @tunetheweb.

Yoast on a home page:

image

It is RDFa. Maybe we need to separate out simple RDFa meta tags from RDFa structured data.

The script does capture rdfa_vocabs, rdfa_prefixes and rdfa_typeofs. Maybe their presence could be used to determine if it is structured.

In terms of actual types of metadata, it may be useful to have a query for description, with the varying ways of specifying it (e.g. <meta name=description>, OG, twitter:description etc), and same for image.

1 Like