Representative URLs for common unusual markup for investigation

As part of the almanac efforts, we have redone the way we collect information about what elements are used - we’re now actively using the DOM tree and simply iterating and collecting the .tagNames of elements and incrementing a counter and @rviscomi created new queries https://github.com/HTTPArchive/almanac.httparchive.org/pull/115/files#diff-87dbe22d3cdba14d71ea8853ffb7bbd2

The results of which can be seen in https://docs.google.com/spreadsheets/d/1WnDKLar_0Btlt9UgT53Giy2229bpV4IM2D_v6OM_WzA/edit?usp=sharing

in the sheet 03_02b, you will note that a number of strange elements occur - lots and lots and lots of numbers as tag names.

  • <0> and <1> on over 12,000 urls…
  • 8,9,2,7,5,3,4,6 all occur on 1624 urls,
  • 14, 11, 21, 22, 17, 13, 20, 12, 10, 16, 15, 18 and 19 all occur on 1623

and it just keeps going - ranges of numbers ocuring in the name number of urls up to… well, basically we don’t know because we had to cap it at 5k per set. So, that seems curious… First that they are numbers, yes - but then that they are also occuring in patterns like that on lots of websites. A few of us were curious if there is something to it and would like to investigate - but we would need some samplings of urls that contain numbered elements - obviously we wont be able to look at all of them, but having links to know where they came from would be handy.

In fact, for a lot of these it would be very handy. For some because we’d like to see if we can figure out what custom element it is, and for others because I’d like to understand how 386 urls wind up with an element whose .tagName is <align=“center”>. There are lots of weird ones like <id=“su-post-891”> in fact, but most of those occur on only 1 or 2 pages… I’m not sure how you do that but it’s far less interesting than the same mistake that appears to happen with some kind of significant repeat.

If anyone could help us think of a handy way to easily get some representative sample urls, it would be great. / @slightlylate

1 Like

The tables the queries run against have the page url as a top-level column (next to payload). Rick should be able to modify one of the queries to Target the elements you are looking for fairly easily.

While investigating this, it turns out to be a bug in our custom metric, or at least the way my query handles it. Here’s the original query:

#standardSQL
# 03_02b: Top elements
CREATE TEMPORARY FUNCTION getElements(payload STRING)
RETURNS ARRAY<STRING> LANGUAGE js AS '''
try {
  var $ = JSON.parse(payload);
  var elements = JSON.parse($._element_count);
  return Object.keys(elements);
} catch (e) {
  return [];
}
''';

SELECT
  _TABLE_SUFFIX AS client,
  element,
  COUNT(0) AS freq,
  SUM(COUNT(0)) OVER (PARTITION BY _TABLE_SUFFIX) AS total,
  ROUND(COUNT(0) * 100 / SUM(COUNT(0)) OVER (PARTITION BY _TABLE_SUFFIX), 2) AS pct
FROM
  `httparchive.pages.2019_07_01_*`,
  UNNEST(getElements(payload)) AS element
GROUP BY
  client,
  element
ORDER BY
  freq / total DESC,
  client

Sometimes the _element_count custom metric has bogus values. For example, whatever this is:

_element_count: "[{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{"jQuery18209417830235755686":5},{"jQuery18209417830235755686":3},{"jQuery18209417830235755686":1}]"

So Object.keys will produce an integer value for each index in the array (because it considers an array to be an object). The fix is to throw out anything that’s not actually an object:

if (Array.isArray(elements) || typeof elements != 'object') return [];
#standardSQL
# 03_02b: Top elements
CREATE TEMPORARY FUNCTION getElements(payload STRING)
RETURNS ARRAY<STRING> LANGUAGE js AS '''
try {
  var $ = JSON.parse(payload);
  var elements = JSON.parse($._element_count);
  if (Array.isArray(elements) || typeof elements != 'object') return [];
  return Object.keys(elements);
} catch (e) {
  return [];
}
''';

SELECT
  _TABLE_SUFFIX AS client,
  element,
  COUNT(0) AS freq,
  SUM(COUNT(0)) OVER (PARTITION BY _TABLE_SUFFIX) AS total,
  ROUND(COUNT(0) * 100 / SUM(COUNT(0)) OVER (PARTITION BY _TABLE_SUFFIX), 2) AS pct
FROM
  `httparchive.pages.2019_07_01_*`,
  UNNEST(getElements(payload)) AS element
GROUP BY
  client,
  element
ORDER BY
  freq / total DESC,
  client

I’ve updated the sheet with the corrected results.

Is there an example url which was producing this for the custom metric? I’m kind of curious to see how it happens and what the real implication is

Added a new sheet with some example URLs of malformed _element_count objects. Most are empty arrays but some contain the odd [{},{},{},{}]-style results.

Here’s the actual script of the custom metric:

return JSON.stringify(
  Array.from(
    document
      .querySelectorAll('*'))
      .reduce((acc, el) => {
        let tag = el.tagName.toLowerCase()
        acc[tag] = (typeof acc[tag] !== 'undefined') ? acc[tag] : 0
        acc[tag]++
        return acc 
      }, {}
  )
)

Browsing some of the URLs in the list, I can consistently reproduce the results. I think this is the culprit:

image

It seems they’ve overridden the built-in behavior of Array.prototype.reduce, which causes the custom metric to generate bogus data.

#standardSQL
CREATE TEMPORARY FUNCTION isMalformed(payload STRING)
RETURNS BOOLEAN LANGUAGE js AS '''
try {
  var $ = JSON.parse(payload);
  var elements = JSON.parse($._element_count);
  return (Array.isArray(elements) || typeof elements != 'object');
} catch (e) {
  return false;
}
''';

SELECT
  url,
  ARRAY_AGG(STRUCT(category, app)),
  JSON_EXTRACT_SCALAR(payload, '$._element_count') AS element_count
FROM
  `httparchive.pages.2019_07_01_desktop`
JOIN
  `httparchive.technologies.2019_07_01_desktop`
USING
  (url)
WHERE
  isMalformed(payload)
GROUP BY
  url, payload
LIMIT 100

On a hunch, I also queried for the Wappalyzer-detected technologies to see if they had anything in common. Based on an eyeball test, 100% of the malformed URLs include the Prototype JS framework.

A perusal of their source code turned up some interesting backstory. Apparently they used to support a Array.prototype.reduce method, before it was actually natively implemented. A changelog entry from 2009 mentions removing it for this reason. In fact, one of its tests still exists and matches the implementation I see on one of the affected URLs:

So to close the case on this, it seems these websites are using an old (< 2009) version of Prototype which overrides the built-in behavior of Array.prototype.reduce. And this is why you don’t override global objects!!!