How to find how many websites are using Magento?

How to find how many websites are using Magento?

Hi @nman. There are 2 ways I can think of to identify Magento sites:

  • Many Magento sites have requests for resources containing the string /skin/frontend/, which we can search for in the requests tables
  • Most Magento sites I’ve seen set a Cookie named frontend, so we can look for the presence of those cookies in the HAR files.

Let’s take a look at the resources containing /skin/frontend/ first. This simple query shows us examples of some of the requests matching this pattern:

SELECT url
FROM `httparchive.runs.2018_02_15_requests` r
WHERE r.url LIKE "%/skin/frontend/%"
LIMIT 10

When we look at the results of it, I can see lots of Magento URLs -
image

The requests table has a pageid field, which we can join against the pages table to summarize this. For example, the following query does the same match against the request URL, but joins the pages table and counts the number of requests matching the pattern per page.

SELECT rank, p.url, count(*) frontendreqs
FROM `httparchive.runs.2018_02_15_requests` r
INNER JOIN `httparchive.runs.2018_02_15_pages` p 
ON r.pageid=p.pageid
WHERE r.url LIKE "%/skin/frontend/%"
GROUP BY rank, p.url
ORDER BY rank ASC

The results look like this - and there are 5897 sites -
image

The HAR file approach is a bit more complex because we need to extract the cookie names from the JSON, and there are often more than 1 cookie set per page. I’m using a JavaScript function that @rviscomi shared with me a while back that outputs headers to an array (and then can be UNNESTed to summarize them later). The query counts the number of cookies that are specifically named frontend, and returned 5129 results.

(Note: This query processes 187GB of data, 1/5th of the monthly free tier)

CREATE TEMPORARY FUNCTION getHeaderNames(payload STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
  var $ = JSON.parse(payload);
  var headers = $.response.headers;
  
  var headernames=[];
  j=0;
  for (i in headers) { 
    if (headers[i].name.indexOf('set-cookie') != -1) {
        headernames[j] = headers[i].name + ': ' + headers[i].value;
        j++;
      }
    i++;
   }
  
  try {
    return headernames;
  } catch (e) {
    return [];
  }
""";

SELECT page, COUNT(*) magento_cookies
FROM (
  SELECT page, url, getHeaderNames(lower(payload)) cookienamevalues
  FROM `httparchive.har.2018_02_01_chrome_requests`
)
CROSS JOIN 
   UNNEST(cookienamevalues) AS cookienamevalue
WHERE cookienamevalue LIKE "set-cookie: frontend=%"
GROUP BY page

When I combined the results from both queries and filtered out the duplicate URLs, I found that there are 6165 sites in the HTTP Archive that are likley using Magento.

1 Like

There’s also the Wappalyzer detection that should have kicked in with the 4/1 crawl but I don’t see the data for 4/1 available yet.

2 Likes

Cool. Will that be in the pages HAR table?

Yep. Should looks something like this:

"_detected": {
    "JavaScript Frameworks": "Moment.js,RequireJS,Underscore.js,jQuery,jQuery Migrate,jQuery UI",
    "Ecommerce": "Magento 2",
    "Web Frameworks": "Bootstrap",
    "Web Servers": "Apache",
    "Programming Languages": "PHP"
},
"_detected_apps": {
    "jQuery": "",
    "Bootstrap": "",
    "jQuery UI": "",
    "jQuery Migrate": "",
    "Moment.js": "",
    "Underscore.js": "",
    "Magento": "2",
    "Apache": "",
    "PHP": "",
    "RequireJS": ""
},

Identifying sites with Magento should be as easy as using JSON_EXTRACT for _detected_apps.Magento and filtering for cases where it is not NULL (if the version isn’t detected it will be an empty string).

3 Likes

Looks like the 4/1 dataset just landed yesterday! Running the following query returned 19 different Magento versions, totalling 4192 sites.

SELECT JSON_EXTRACT(payload,"$._detected.Ecommerce") ecommerce_vendor, count(*) freq
FROM `httparchive.pages.2018_04_01_desktop`
WHERE JSON_EXTRACT(payload,"$._detected.Ecommerce") LIKE "%Magento%"
GROUP BY ecommerce_vendor
ORDER BY freq DESC

image

1 Like

Also, an even simpler way to query this now would be:

SELECT count(*) freq
FROM `httparchive.pages.2018_04_01_desktop`
WHERE JSON_EXTRACT(payload,"$._detected_apps.Magento") IS NOT NULL

Which returns 4192 sites.

3 Likes