How many pages include resources from web.archive.org

After the recent Twitter discussion of Barclays including a script from web.archive.org here’s a query that attempts to discover how many other sites are doing something similar

SELECT
  pages.pageid,
  pages.url as pages_url,
  requests.url as requests_url
FROM
  `httparchive.summary_pages.2020_05_01_mobile` AS pages
JOIN
  `httparchive.summary_requests.2020_05_01_mobile` AS requests
ON
  pages.pageid = requests.pageid
WHERE
  requests.url LIKE "https://web.archive.org%"
  and pages.url NOT LIKE "https://web.archive.org%"
GROUP BY
  pages.pageid,
  pages.url,
  requests.url

Results set has 4,364 resources that are requested from web.archive.org, some pages include multiple resources from there - https://docs.google.com/spreadsheets/d/1Y-TLGPRlupaLKPYncF_MSw0x3rwCGgZDp63fC7x7Sb8/edit?usp=sharing

Deduplicating the pages… there are 838 sites making the requests

SELECT
  DISTINCT pages.url
FROM
  `httparchive.summary_pages.2020_05_01_mobile` AS pages
JOIN
  `httparchive.summary_requests.2020_05_01_mobile` AS requests
ON
  pages.pageid = requests.pageid
WHERE
  requests.url LIKE "https://web.archive.org%"
  and pages.url NOT LIKE "https://web.archive.org%"

Results: https://docs.google.com/spreadsheets/d/12J_mrvR7t0fxU-t-SuduYhTYR94S0G6k1awuyrEMmQY/edit?usp=sharing

Queries only check for resources requested over HTTPS and there may be some from via HTTP

1 Like