By ask on twitter https://twitter.com/briankardell/status/1100081446528708611
I ran
#standardSQL
SELECT * FROM (
SELECT
COUNT(*) AS num,
REGEXP_EXTRACT(r.body, r"(?i)(<[a-z]+-[a-z0-9_-]*\s+(?:[^>/]|/[^>])+>)") AS match
FROM
`httparchive.response_bodies.2019_02_01_desktop` AS r
WHERE
page = url
GROUP BY
match
ORDER BY
num DESC
) WHERE match IS NOT NULL
See https://gist.github.com/zcorpan/56d1040e1afaa883b610c342c2e7a437
1 Like
I’m not skilled with this and it seems potentially pricey to try to learn on the fly but would it make sense to do
#standardSQL
SELECT * FROM (
SELECT
COUNT(*) AS frequency,
REGEXP_EXTRACT(r.body, r"(?i)(<[a-z]+-[a-z0-9_-]*\s+(?:[^>/]|/[^>])+>)") AS match,
COUNT(DISTINCT url) AS urls
FROM
`httparchive.response_bodies.2019_02_01_desktop` AS r
WHERE
page = url
GROUP BY
match
ORDER BY
frequency DESC
) WHERE match IS NOT NULL
So that this matches the same basic format/data of better one in Use of HTML elements ? Maybe not entirely as this needs further processing still to get just the tag name and join up the counts… It feels like these two ‘ideas’ tho would be good ones to run/publish every so often in a place that is easy to find/discuss.