Use of HTML elements

I took rviscomi’s query from Use of deprecated HTML features on the web? - #2 by rviscomi and included also the elements from the HTML standard’s element index.

#standardSQL
SELECT
  LOWER(tag) AS tag,
  COUNT(0) AS frequency,
  COUNT(DISTINCT url) AS urls
FROM (
  SELECT
    url,
    REGEXP_EXTRACT_ALL(body,
     r'(?i)<(a|abbr|address|area|article|aside|audio|b|base|bdi|bdo|blockquote|body|br|button|canvas|caption|cite|code|col|colgroup|data|datalist|dd|del|details|dfn|dialog|div|dl|dt|em|embed|fieldset|figcaption|figure|footer|form|h1|h2|h3|h4|h5|h6|head|header|hgroup|hr|html|i|iframe|img|input|ins|kbd|label|legend|li|link|main|map|mark|math|menu|meta|meter|nav|noscript|object|ol|optgroup|option|output|p|param|picture|pre|progress|q|rp|rt|ruby|s|samp|script|section|select|slot|small|source|span|strong|style|sub|summary|sup|svg|table|tbody|td|template|textarea|tfoot|th|thead|time|title|tr|track|u|ul|var|video|wbr|applet|acronym|bgsound|dir|noframes|isindex|keygen|listing|menuitem|nextid|noembed|plaintext|rb|rtc|strike|xmp|basefont|big|blink|center|font|multicol|nobr|spacer|tt|marquee)(?:\s|/?>)') AS tags
  FROM
    `httparchive.response_bodies.2018_08_01_desktop`
  WHERE
    page = url),
  UNNEST(tags) AS tag
GROUP BY
  tag
ORDER BY
  frequency DESC

result

tag frequency urls
div 220195805 1211109
a 199458214 1200656
span 114503663 1112859
li 103969319 1058955
img 43772141 1159543
br 37696955 919076
p 34059594 1049666
script 28445225 1219710
td 21821910 379919
ul 20357473 1058381
option 19794469 225487
meta 17181194 1243561
link 15729968 1203501
i 14679869 507695
input 11293319 963068
tr 9006510 381240
strong 8277464 533532
h2 7810122 824400
h3 7713738 687682
b 6416492 365688
h4 3979458 391438
label 3559358 465348
table 3239199 383133
button 2943780 528577
font 2824141 114445
svg 2781271 174392
style 2744027 771295
section 2735708 355809
article 2655492 239268
dd 2460003 128956
em 2363419 181172
form 2173133 913408
h1 2141903 792153
noscript 2059937 657878
html 1801751 1254018
header 1778951 594533
time 1639966 100274
dt 1628304 137094
title 1546053 1243039
body 1340941 1251391
figure 1313226 90936
head 1310642 1258649
dl 1262478 144319
h5 1187033 139380
iframe 1168314 480719
hr 1130165 213642
footer 1129654 589297
nav 1116152 522793
th 1103263 94127
small 1081972 126187
source 728478 43378
aside 699185 220425
select 682147 243226
tbody 677575 169824
center 647241 126940
h6 554572 72875
abbr 512469 67101
sup 499779 45797
ins 488111 145038
u 470117 71011
ol 427367 122008
blockquote 333552 80347
picture 286900 22599
figcaption 284306 28687
fieldset 272000 108874
area 260413 22924
code 213435 9862
textarea 188809 108455
video 169210 58101
cite 167805 16468
dfn 160542 3198
main 157028 146292
pre 152872 24701
param 147404 19957
nobr 125292 10118
base 117838 114609
del 111279 8155
thead 102634 45434
address 85313 49030
s 74612 9097
optgroup 71642 10139
wbr 71276 3749
legend 70593 31788
col 70233 7857
sub 61539 6689
big 58883 8717
strike 57023 5489
kbd 55335 26240
object 50800 25941
map 49969 25427
hgroup 35623 15428
var 33507 1307
embed 33089 18822
caption 32155 24192
summary 26044 5249
marquee 25276 15068
menu 22740 6459
canvas 21707 8300
colgroup 19611 8885
template 19535 3054
bdi 19211 563
q 18785 3234
audio 18748 5507
tt 16842 1652
tfoot 13625 10410
mark 11541 2277
details 8623 3585
samp 7889 450
acronym 7810 1588
blink 5461 1554
rp 5388 161
spacer 4158 255
rt 3731 254
ruby 3726 264
noframes 3481 3052
bdo 3025 267
rb 1987 146
data 1958 286
meter 1422 85
menuitem 1334 52
output 1284 380
xmp 1181 176
slot 1090 210
plaintext 1075 960
progress 1029 557
dialog 831 611
applet 812 710
math 758 41
datalist 689 503
track 583 322
basefont 561 233
dir 546 122
bgsound 349 325
noembed 270 199
nextid 8 1
multicol 5 2
listing 2 2

Number of pages is

SELECT COUNT(0) AS num FROM [httparchive:response_bodies.2018_08_01_desktop] WHERE page = url

→ 1294654

Ordering by urls and adding a percentage col:

tag frequency urls percent
head 1310642 1258649 97.22
html 1801751 1254018 96.86
body 1340941 1251391 96.66
meta 17181194 1243561 96.05
title 1546053 1243039 96.01
script 28445225 1219710 94.21
div 220195805 1211109 93.55
link 15729968 1203501 92.96
a 199458214 1200656 92.74
img 43772141 1159543 89.56
span 114503663 1112859 85.96
li 103969319 1058955 81.79
ul 20357473 1058381 81.75
p 34059594 1049666 81.08
input 11293319 963068 74.39
br 37696955 919076 70.99
form 2173133 913408 70.55
h2 7810122 824400 63.68
h1 2141903 792153 61.19
style 2744027 771295 59.58
h3 7713738 687682 53.12
noscript 2059937 657878 50.81
header 1778951 594533 45.92
footer 1129654 589297 45.52
strong 8277464 533532 41.21
button 2943780 528577 40.83
nav 1116152 522793 40.38
i 14679869 507695 39.21
iframe 1168314 480719 37.13
label 3559358 465348 35.94
h4 3979458 391438 30.23
table 3239199 383133 29.59
tr 9006510 381240 29.45
td 21821910 379919 29.35
b 6416492 365688 28.25
section 2735708 355809 27.48
select 682147 243226 18.79
article 2655492 239268 18.48
option 19794469 225487 17.42
aside 699185 220425 17.03
hr 1130165 213642 16.5
em 2363419 181172 13.99
svg 2781271 174392 13.47
tbody 677575 169824 13.12
main 157028 146292 11.3
ins 488111 145038 11.2
dl 1262478 144319 11.15
h5 1187033 139380 10.77
dt 1628304 137094 10.59
dd 2460003 128956 9.96
center 647241 126940 9.8
small 1081972 126187 9.75
ol 427367 122008 9.42
base 117838 114609 8.85
font 2824141 114445 8.84
fieldset 272000 108874 8.41
textarea 188809 108455 8.38
time 1639966 100274 7.75
th 1103263 94127 7.27
figure 1313226 90936 7.02
blockquote 333552 80347 6.21
h6 554572 72875 5.63
u 470117 71011 5.48
abbr 512469 67101 5.18
video 169210 58101 4.49
address 85313 49030 3.79
sup 499779 45797 3.54
thead 102634 45434 3.51
source 728478 43378 3.35
legend 70593 31788 2.46
figcaption 284306 28687 2.22
kbd 55335 26240 2.03
object 50800 25941 2.0
map 49969 25427 1.96
pre 152872 24701 1.91
caption 32155 24192 1.87
area 260413 22924 1.77
picture 286900 22599 1.75
param 147404 19957 1.54
embed 33089 18822 1.45
cite 167805 16468 1.27
hgroup 35623 15428 1.19
marquee 25276 15068 1.16
tfoot 13625 10410 0.8
optgroup 71642 10139 0.78
nobr 125292 10118 0.78
code 213435 9862 0.76
s 74612 9097 0.7
colgroup 19611 8885 0.69
big 58883 8717 0.67
canvas 21707 8300 0.64
del 111279 8155 0.63
col 70233 7857 0.61
sub 61539 6689 0.52
menu 22740 6459 0.5
audio 18748 5507 0.43
strike 57023 5489 0.42
summary 26044 5249 0.41
wbr 71276 3749 0.29
details 8623 3585 0.28
q 18785 3234 0.25
dfn 160542 3198 0.25
template 19535 3054 0.24
noframes 3481 3052 0.24
mark 11541 2277 0.18
tt 16842 1652 0.13
acronym 7810 1588 0.12
blink 5461 1554 0.12
var 33507 1307 0.1
plaintext 1075 960 0.07
applet 812 710 0.05
dialog 831 611 0.05
bdi 19211 563 0.04
progress 1029 557 0.04
datalist 689 503 0.04
samp 7889 450 0.03
output 1284 380 0.03
bgsound 349 325 0.03
track 583 322 0.02
data 1958 286 0.02
bdo 3025 267 0.02
ruby 3726 264 0.02
spacer 4158 255 0.02
rt 3731 254 0.02
basefont 561 233 0.02
slot 1090 210 0.02
noembed 270 199 0.02
xmp 1181 176 0.01
rp 5388 161 0.01
rb 1987 146 0.01
dir 546 122 0.01
meter 1422 85 0.01
menuitem 1334 52 0.0
math 758 41 0.0
multicol 5 2 0.0
listing 2 2 0.0
nextid 8 1 0.0
4 Likes

Hmm, not all elements were included in the result (e.g. body and input are missing.) Not sure why.

Ah, it’s because “b” and “i” already match. I can fix the regexp and run again. (Done.)

Really interesting to see more H2 than H1. Not just in frequency, which itself is interesting to see ~4:1 ratio, but in the number of URLs. That means there are pages with no H1 but go straight to H2. I wonder what if any a11y or SEO impact that might have.

Some assistive technologies have keyboard shortcuts to navigate headings by their rank, so users wouldn’t find any “1” heading. I don’t know how/whether search engines react.

Suspect it might be because many adding content may via CMS’ may presume the title is automatically the H1 when the CMS doesn’t automatically add it? Or the sitewide H1 wrapped around the logo which Wordpress theme suppliers used to add (maybe still do) automatically got stripped out (as it should) and then never replaced in the copy on the page to be specific.

1 Like

I know that I’ve done that before - skip h1 and go straight to h2 - because the styling being used (in the site theme, or just useragent default stylesheet) makes it feel like the text is way too big / the style makes it way more important than the author intends, and “oh no, if I adjusted h1 to not be quite so large, I’d need to change the rest as well”. I suspect that’s what we’re seeing here.

A modification on this that used a pattern rather than a specific list (a-z or dash maybe) would be helpful and enlightening I think as we could learn which tags are gaining some popularity. Unfortunately there can be overlap where two vastly different things share the same name but it is at least a starting point to research that seems hard to discuss right now.