Sampling JSON-LD

Researching the state of structured data on the web in preparation for an upcoming Lighthouse audit. Came up with this query to extract a sample of JSON-LD snippets from the top sites:

SELECT
  rank,
  url,
  data
FROM (
  SELECT
    url,
    JSON_EXTRACT(
      REGEXP_EXTRACT(body,
      '(?i)<script type=[\'"]?application/ld\\+json[\'"]?>(.*)</script>'), '$') AS data
  FROM
    `httparchive.response_bodies.2018_04_15_desktop`
  WHERE
    body LIKE '%application/ld+json%')
JOIN
  `httparchive.summary_pages.2018_04_15_desktop`
USING
  (url)
WHERE
  data IS NOT NULL AND
  rank IS NOT NULL
ORDER BY
  rank
LIMIT
  100

Warning: running queries like this that process response bodies will consume your entire free monthly quota (1 TB)

The results are in this sheet.

Using this sample, we can see what the popular JSON-LD properties are and what kinds of validation would be most impactful in a Lighthouse audit.

1 Like