Representative URLs for common unusual markup for investigation

As part of the almanac efforts, we have redone the way we collect information about what elements are used - we’re now actively using the DOM tree and simply iterating and collecting the .tagNames of elements and incrementing a counter and @rviscomi created new queries https://github.com/HTTPArchive/almanac.httparchive.org/pull/115/files#diff-87dbe22d3cdba14d71ea8853ffb7bbd2

The results of which can be seen in https://docs.google.com/spreadsheets/d/1WnDKLar_0Btlt9UgT53Giy2229bpV4IM2D_v6OM_WzA/edit?usp=sharing

in the sheet 03_02b, you will note that a number of strange elements occur - lots and lots and lots of numbers as tag names.

  • <0> and <1> on over 12,000 urls…
  • 8,9,2,7,5,3,4,6 all occur on 1624 urls,
  • 14, 11, 21, 22, 17, 13, 20, 12, 10, 16, 15, 18 and 19 all occur on 1623

and it just keeps going - ranges of numbers ocuring in the name number of urls up to… well, basically we don’t know because we had to cap it at 5k per set. So, that seems curious… First that they are numbers, yes - but then that they are also occuring in patterns like that on lots of websites. A few of us were curious if there is something to it and would like to investigate - but we would need some samplings of urls that contain numbered elements - obviously we wont be able to look at all of them, but having links to know where they came from would be handy.

In fact, for a lot of these it would be very handy. For some because we’d like to see if we can figure out what custom element it is, and for others because I’d like to understand how 386 urls wind up with an element whose .tagName is <align=“center”>. There are lots of weird ones like <id=“su-post-891”> in fact, but most of those occur on only 1 or 2 pages… I’m not sure how you do that but it’s far less interesting than the same mistake that appears to happen with some kind of significant repeat.

If anyone could help us think of a handy way to easily get some representative sample urls, it would be great. / @slightlylate

1 Like

The tables the queries run against have the page url as a top-level column (next to payload). Rick should be able to modify one of the queries to Target the elements you are looking for fairly easily.

While investigating this, it turns out to be a bug in our custom metric, or at least the way my query handles it. Here’s the original query:

#standardSQL
# 03_02b: Top elements
CREATE TEMPORARY FUNCTION getElements(payload STRING)
RETURNS ARRAY<STRING> LANGUAGE js AS '''
try {
  var $ = JSON.parse(payload);
  var elements = JSON.parse($._element_count);
  return Object.keys(elements);
} catch (e) {
  return [];
}
''';

SELECT
  _TABLE_SUFFIX AS client,
  element,
  COUNT(0) AS freq,
  SUM(COUNT(0)) OVER (PARTITION BY _TABLE_SUFFIX) AS total,
  ROUND(COUNT(0) * 100 / SUM(COUNT(0)) OVER (PARTITION BY _TABLE_SUFFIX), 2) AS pct
FROM
  `httparchive.pages.2019_07_01_*`,
  UNNEST(getElements(payload)) AS element
GROUP BY
  client,
  element
ORDER BY
  freq / total DESC,
  client

Sometimes the _element_count custom metric has bogus values. For example, whatever this is:

_element_count: "[{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{"jQuery18209417830235755686":5},{"jQuery18209417830235755686":3},{"jQuery18209417830235755686":1}]"

So Object.keys will produce an integer value for each index in the array (because it considers an array to be an object). The fix is to throw out anything that’s not actually an object:

if (Array.isArray(elements) || typeof elements != 'object') return [];
#standardSQL
# 03_02b: Top elements
CREATE TEMPORARY FUNCTION getElements(payload STRING)
RETURNS ARRAY<STRING> LANGUAGE js AS '''
try {
  var $ = JSON.parse(payload);
  var elements = JSON.parse($._element_count);
  if (Array.isArray(elements) || typeof elements != 'object') return [];
  return Object.keys(elements);
} catch (e) {
  return [];
}
''';

SELECT
  _TABLE_SUFFIX AS client,
  element,
  COUNT(0) AS freq,
  SUM(COUNT(0)) OVER (PARTITION BY _TABLE_SUFFIX) AS total,
  ROUND(COUNT(0) * 100 / SUM(COUNT(0)) OVER (PARTITION BY _TABLE_SUFFIX), 2) AS pct
FROM
  `httparchive.pages.2019_07_01_*`,
  UNNEST(getElements(payload)) AS element
GROUP BY
  client,
  element
ORDER BY
  freq / total DESC,
  client

I’ve updated the sheet with the corrected results.

Is there an example url which was producing this for the custom metric? I’m kind of curious to see how it happens and what the real implication is

Added a new sheet with some example URLs of malformed _element_count objects. Most are empty arrays but some contain the odd [{},{},{},{}]-style results.

Here’s the actual script of the custom metric:

return JSON.stringify(
  Array.from(
    document
      .querySelectorAll('*'))
      .reduce((acc, el) => {
        let tag = el.tagName.toLowerCase()
        acc[tag] = (typeof acc[tag] !== 'undefined') ? acc[tag] : 0
        acc[tag]++
        return acc 
      }, {}
  )
)

Browsing some of the URLs in the list, I can consistently reproduce the results. I think this is the culprit:

image

It seems they’ve overridden the built-in behavior of Array.prototype.reduce, which causes the custom metric to generate bogus data.

#standardSQL
CREATE TEMPORARY FUNCTION isMalformed(payload STRING)
RETURNS BOOLEAN LANGUAGE js AS '''
try {
  var $ = JSON.parse(payload);
  var elements = JSON.parse($._element_count);
  return (Array.isArray(elements) || typeof elements != 'object');
} catch (e) {
  return false;
}
''';

SELECT
  url,
  ARRAY_AGG(STRUCT(category, app)),
  JSON_EXTRACT_SCALAR(payload, '$._element_count') AS element_count
FROM
  `httparchive.pages.2019_07_01_desktop`
JOIN
  `httparchive.technologies.2019_07_01_desktop`
USING
  (url)
WHERE
  isMalformed(payload)
GROUP BY
  url, payload
LIMIT 100

On a hunch, I also queried for the Wappalyzer-detected technologies to see if they had anything in common. Based on an eyeball test, 100% of the malformed URLs include the Prototype JS framework.

A perusal of their source code turned up some interesting backstory. Apparently they used to support a Array.prototype.reduce method, before it was actually natively implemented. A changelog entry from 2009 mentions removing it for this reason. In fact, one of its tests still exists and matches the implementation I see on one of the affected URLs:

So to close the case on this, it seems these websites are using an old (< 2009) version of Prototype which overrides the built-in behavior of Array.prototype.reduce. And this is why you don’t override global objects!!!

there are definitely some other surprising things in here… non-standard elements that rank pretty high comparatively, considerably more than standard elements - it would be nice if I could run some kind of query as I look at these and see if I can find more about where they come from… Two examples:

  • ym-measure which a few of us discussed already and we think is some yandex thing, though - we can’t see to figure out what it is – is used by over 1% of websites. That puts it in the top 100. only 83 elements from HTML are even in the top 100 because you also have svg being very popular. <code> for comparision is used by 0.57%, and <datalist> by only 0.04%.

  • jdiv is also unusually high - more popular again than a whole bunch of html and svg elements but the best I can find on jdiv is a java library that hasn’t been updated in ~6 years and it’s not clear to me what that’s about. Would be real interesting to see… there are a bunch of others.

Modified 03_2b to extract example URLs from the 1K random sample table:

#standardSQL
CREATE TEMPORARY FUNCTION getElements(payload STRING)
RETURNS ARRAY<STRING> LANGUAGE js AS '''
try {
  var $ = JSON.parse(payload);
  var elements = JSON.parse($._element_count);
  if (Array.isArray(elements) || typeof elements != 'object') return [];
  return Object.keys(elements);
} catch (e) {
  return [];
}
''';

SELECT
  _TABLE_SUFFIX AS client,
  element,
  url
FROM
  `httparchive.almanac.pages_*`,
  UNNEST(getElements(payload)) AS element
WHERE
  element IN ('ym-measure', 'jdiv')
ORDER BY
  client,
  element,
  url
client element url
desktop_1k jdiv https://interkom-l.ru/
desktop_1k jdiv https://ip-pro.com.ua/
desktop_1k jdiv https://marhabaautoauction.com/
desktop_1k jdiv https://proxy.market/
desktop_1k jdiv https://wodolei.ru/
desktop_1k jdiv https://www.loja.reislixeiras.com.br/
desktop_1k ym-measure http://1.ci74.ru/
desktop_1k ym-measure http://99-kopeek.ru/
desktop_1k ym-measure http://banevo.ru/
desktop_1k ym-measure http://kinostrana.online/
desktop_1k ym-measure http://vollarinfo.com/
desktop_1k ym-measure https://24narko.ru/
desktop_1k ym-measure https://boodet.online/
desktop_1k ym-measure https://health.yandex.ru/
desktop_1k ym-measure https://interkom-l.ru/
desktop_1k ym-measure https://ip-pro.com.ua/
desktop_1k ym-measure https://ksr.ru/
desktop_1k ym-measure https://moldova.mid.ru/
desktop_1k ym-measure https://nizhnevartovsk.superjob.ru/
desktop_1k ym-measure https://prowebber.ru/
desktop_1k ym-measure https://uabrides.com/
desktop_1k ym-measure https://zippack.ru/
mobile_1k jdiv http://www.rollmaster.ru/
mobile_1k jdiv https://coobachy.ru/
mobile_1k jdiv https://marktstein.ru/
mobile_1k jdiv https://ucsol.ru/
mobile_1k jdiv https://www.minskline.by/
mobile_1k ym-measure http://www.nlv.ru/
mobile_1k ym-measure http://общество-защиты-прав.org/
mobile_1k ym-measure https://chessrussian.ru/
mobile_1k ym-measure https://coobachy.ru/
mobile_1k ym-measure https://dizainexpert.ru/
mobile_1k ym-measure https://mz-gallery.ru/
mobile_1k ym-measure https://www.minskline.by/

perfect – I downloaded some of the data from the summary on 03_02b as csv and have been programatically querying and exploring a bunch of stuff… I have a list of what are probably parsing errors and a few of those jump out at me as well – here’s one:
https://docs.google.com/spreadsheets/d/1WnDKLar_0Btlt9UgT53Giy2229bpV4IM2D_v6OM_WzA/edit#gid=1427309584&range=B510

I tried querying this via the above but it returned 0 results – possibly escaping issue?

I think that’s because the 1k sample set is too small to have representation from all unusual elements.

I queried the full dataset and came up with these similar-looking websites:

client element url
desktop pclass=“ddc-font-size-large” http://www.acuraofcolumbus.com/
desktop pclass=“ddc-font-size-large” http://www.americanautomart.net/
desktop pclass=“ddc-font-size-large” http://www.atlantaautobrokers.com/
desktop pclass=“ddc-font-size-large” http://www.baronhonda.com/
desktop pclass=“ddc-font-size-large” http://www.bmwofmacon.com/
desktop pclass=“ddc-font-size-large” http://www.buyatbayird.com/
desktop pclass=“ddc-font-size-large” http://www.ennisford.net/
desktop pclass=“ddc-font-size-large” http://www.epicautosalestx.com/
desktop pclass=“ddc-font-size-large” http://www.foxvalleyvw.com/
desktop pclass=“ddc-font-size-large” http://www.griffithfordsanmarcos.com/
desktop pclass=“ddc-font-size-large” http://www.hamiltonfordcountry.com/
desktop pclass=“ddc-font-size-large” http://www.herrnstein.com/
desktop pclass=“ddc-font-size-large” http://www.hyundaiofyuma.com/
desktop pclass=“ddc-font-size-large” http://www.jacksonford.com/
desktop pclass=“ddc-font-size-large” http://www.jannellford.com/
desktop pclass=“ddc-font-size-large” http://www.jordanmotorcars.com/
desktop pclass=“ddc-font-size-large” http://www.jwford.com/
desktop pclass=“ddc-font-size-large” http://www.machaikkilleen.com/
desktop pclass=“ddc-font-size-large” http://www.mccombssuperiorhyundai.com/
desktop pclass=“ddc-font-size-large” http://www.reedchevystj.com/
desktop pclass=“ddc-font-size-large” http://www.rohrmanhyundai.com/
desktop pclass=“ddc-font-size-large” http://www.sharpebmw.com/
desktop pclass=“ddc-font-size-large” http://www.sterlingautogroup.net/
desktop pclass=“ddc-font-size-large” http://www.temeculahyundai.com/
desktop pclass=“ddc-font-size-large” http://www.vidaliaford.com/
desktop pclass=“ddc-font-size-large” http://www.weinscanada.com/
desktop pclass=“ddc-font-size-large” http://www.yarmonford.com/

Sampling one of the websites, sure enough we see <pclass="ddc-font-size-large">

#standardSQL
CREATE TEMPORARY FUNCTION getElements(payload STRING)
RETURNS ARRAY<STRING> LANGUAGE js AS '''
try {
  var $ = JSON.parse(payload);
  var elements = JSON.parse($._element_count);
  if (Array.isArray(elements) || typeof elements != 'object') return [];
  return Object.keys(elements);
} catch (e) {
  return [];
}
''';

SELECT
  _TABLE_SUFFIX AS client,
  element,
  url
FROM
  `httparchive.pages.2019_07_01_*`,
  UNNEST(getElements(payload)) AS element
WHERE
  element IN ('pclass="ddc-font-size-large"')
ORDER BY
  client,
  element,
  url

This query is 390 GB so it’ll eat up a lot of the free monthly quota.