Researching the state of structured data on the web in preparation for an upcoming Lighthouse audit. Came up with this query to extract a sample of JSON-LD snippets from the top sites:
SELECT
rank,
url,
data
FROM (
SELECT
url,
JSON_EXTRACT(
REGEXP_EXTRACT(body,
'(?i)<script type=[\'"]?application/ld\\+json[\'"]?>(.*)</script>'), '$') AS data
FROM
`httparchive.response_bodies.2018_04_15_desktop`
WHERE
body LIKE '%application/ld+json%')
JOIN
`httparchive.summary_pages.2018_04_15_desktop`
USING
(url)
WHERE
data IS NOT NULL AND
rank IS NOT NULL
ORDER BY
rank
LIMIT
100
Warning: running queries like this that process response bodies will consume your entire free monthly quota (1 TB)
The results are in this sheet.
Using this sample, we can see what the popular JSON-LD properties are and what kinds of validation would be most impactful in a Lighthouse audit.