Which pages have duplicate scripts?

Some widgets (such as the Like, Tweet, and +1 buttons) are included multiple times in a single HTML page. Depending on the construction of the snippet, this may cause performance problems. Given the most popular scripts, we can construct a query that looks for pages that use that script multiple times. For example, here’s the query for “plusone.js”:

SELECT rank, numjs, page
FROM [httparchive:runs.2014_08_15_pages] as p
JOIN ( 
  SELECT page, ROUND((LENGTH(body) - LENGTH(REGEXP_REPLACE(body, r'plusone.js', '')))/LENGTH('plusone.js')) as numjs
  FROM [httparchive:runs.2014_08_15_requests_body]
  WHERE page=url AND mimeType="text/html" AND body contains 'plusone.js'
  HAVING numjs >= 2
  ) as b
ON b.page = p.url
WHERE rank > 0
ORDER BY rank asc

Let me explain this query:

The inner SELECT uses Ilya’s new response bodies table. We restrict our search to the main HTML document by specifying “page=url” - this means the URL of the response body is the same as the main page’s URL and is therefore the main page. The “numjs” value is based on removing all occurrences of the script name (“plusone.js”), and then finding the difference in length between the original and modified HTML body. Dividing this different by the length of the script name reveals how many times that script was included in the page. This is a tricky query and might need modification depending on how the script is specified.

The outer SELECT joins with the “pages” table to give us the rank of the website.

1 Like