Hi @nman. There are 2 ways I can think of to identify Magento sites:
- Many Magento sites have requests for resources containing the string
/skin/frontend/
, which we can search for in the requests tables - Most Magento sites I’ve seen set a Cookie named
frontend
, so we can look for the presence of those cookies in the HAR files.
Let’s take a look at the resources containing /skin/frontend/
first. This simple query shows us examples of some of the requests matching this pattern:
SELECT url
FROM `httparchive.runs.2018_02_15_requests` r
WHERE r.url LIKE "%/skin/frontend/%"
LIMIT 10
When we look at the results of it, I can see lots of Magento URLs -
The requests table has a pageid
field, which we can join against the pages table to summarize this. For example, the following query does the same match against the request URL, but joins the pages table and counts the number of requests matching the pattern per page.
SELECT rank, p.url, count(*) frontendreqs
FROM `httparchive.runs.2018_02_15_requests` r
INNER JOIN `httparchive.runs.2018_02_15_pages` p
ON r.pageid=p.pageid
WHERE r.url LIKE "%/skin/frontend/%"
GROUP BY rank, p.url
ORDER BY rank ASC
The results look like this - and there are 5897 sites -
The HAR file approach is a bit more complex because we need to extract the cookie names from the JSON, and there are often more than 1 cookie set per page. I’m using a JavaScript function that @rviscomi shared with me a while back that outputs headers to an array (and then can be UNNESTed to summarize them later). The query counts the number of cookies that are specifically named frontend
, and returned 5129 results.
(Note: This query processes 187GB of data, 1/5th of the monthly free tier)
CREATE TEMPORARY FUNCTION getHeaderNames(payload STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
var $ = JSON.parse(payload);
var headers = $.response.headers;
var headernames=[];
j=0;
for (i in headers) {
if (headers[i].name.indexOf('set-cookie') != -1) {
headernames[j] = headers[i].name + ': ' + headers[i].value;
j++;
}
i++;
}
try {
return headernames;
} catch (e) {
return [];
}
""";
SELECT page, COUNT(*) magento_cookies
FROM (
SELECT page, url, getHeaderNames(lower(payload)) cookienamevalues
FROM `httparchive.har.2018_02_01_chrome_requests`
)
CROSS JOIN
UNNEST(cookienamevalues) AS cookienamevalue
WHERE cookienamevalue LIKE "set-cookie: frontend=%"
GROUP BY page
When I combined the results from both queries and filtered out the duplicate URLs, I found that there are 6165 sites in the HTTP Archive that are likley using Magento.