Can’t resist.
I took the list of deprecated elements and created a regex pattern: Array.from($0.querySelectorAll('dfn code')).map(e => e.innerText).join('|') which results in: "applet|acronym|bgsound|dir|noframes|isindex|keygen|listing|menuitem|nextid|noembed|plaintext|rb|rtc|strike|xmp|basefont|big|blink|center|font|multicol|nobr|spacer|tt|marquee".
I’m matching elements if the tag name is preceded by a <. Not super robust but a good place to start.
#standardSQL
SELECT
LOWER(tag) AS tag,
COUNT(0) AS frequency,
COUNT(DISTINCT url) AS urls
FROM (
SELECT
url,
REGEXP_EXTRACT_ALL(body, '(?i)<(applet|acronym|bgsound|dir|noframes|isindex|keygen|listing|menuitem|nextid|noembed|plaintext|rb|rtc|strike|xmp|basefont|big|blink|center|font|multicol|nobr|spacer|tt|marquee)') AS tags
FROM
`httparchive.response_bodies.2018_07_01_desktop`),
UNNEST(tags) AS tag
GROUP BY
tag
ORDER BY
frequency DESC
WARNING: 2.5 TB!
| tag | frequency | urls |
|---|---|---|
| font | 5143594 | 223473 |
| center | 1070270 | 217903 |
| tt | 333876 | 41528 |
| nobr | 225042 | 30122 |
| big | 101907 | 18438 |
| strike | 98386 | 12867 |
| menuitem | 78198 | 9896 |
| xmp | 78162 | 62446 |
| plaintext | 65701 | 57276 |
| marquee | 49669 | 25423 |
| rb | 41385 | 4954 |
| dir | 32802 | 4690 |
| basefont | 27476 | 646 |
| acronym | 25012 | 2687 |
| applet | 12842 | 5128 |
| blink | 11097 | 4689 |
| noframes | 9850 | 4700 |
| spacer | 8141 | 688 |
| noembed | 7659 | 852 |
| bgsound | 5692 | 781 |
| listing | 2886 | 640 |
| rtc | 230 | 53 |
| nextid | 193 | 165 |
| keygen | 109 | 40 |
| isindex | 73 | 64 |
| multicol | 19 | 16 |
Anyone else want to look into the deprecated attributes?
Edit: @HenriHelvetica pointed out that the original query didn’t include marquee. I’ve rerun the query and updated the table.