As part of the almanac efforts, we have redone the way we collect information about what elements are used - we’re now actively using the DOM tree and simply iterating and collecting the .tagNames of elements and incrementing a counter and @rviscomi created new queries https://github.com/HTTPArchive/almanac.httparchive.org/pull/115/files#diff-87dbe22d3cdba14d71ea8853ffb7bbd2
The results of which can be seen in https://docs.google.com/spreadsheets/d/1WnDKLar_0Btlt9UgT53Giy2229bpV4IM2D_v6OM_WzA/edit?usp=sharing
in the sheet 03_02b, you will note that a number of strange elements occur - lots and lots and lots of numbers as tag names.
- <0> and <1> on over 12,000 urls…
- 8,9,2,7,5,3,4,6 all occur on 1624 urls,
- 14, 11, 21, 22, 17, 13, 20, 12, 10, 16, 15, 18 and 19 all occur on 1623
and it just keeps going - ranges of numbers ocuring in the name number of urls up to… well, basically we don’t know because we had to cap it at 5k per set. So, that seems curious… First that they are numbers, yes - but then that they are also occuring in patterns like that on lots of websites. A few of us were curious if there is something to it and would like to investigate - but we would need some samplings of urls that contain numbered elements - obviously we wont be able to look at all of them, but having links to know where they came from would be handy.
In fact, for a lot of these it would be very handy. For some because we’d like to see if we can figure out what custom element it is, and for others because I’d like to understand how 386 urls wind up with an element whose .tagName is <align=“center”>. There are lots of weird ones like <id=“su-post-891”> in fact, but most of those occur on only 1 or 2 pages… I’m not sure how you do that but it’s far less interesting than the same mistake that appears to happen with some kind of significant repeat.
If anyone could help us think of a handy way to easily get some representative sample urls, it would be great. / @slightlylate