Using Wappalyzer to Analyze CPU Times Across JS Frameworks


#1

Last month WebPageTest added support for Wappalyzer, which makes it super easy to uncover technologies used on websites. And now a month later the data from it is available in the HTTP Archive!

Prior to this integration, if you wanted to search the HTTP Archive for specific technologies then you would need to know exactly what to look for based on HTTP response headers, URL patterns, embedded objects, payloads in JS or CSS, etc. Now it’s simply a matter of parsing JSON out of a HAR file from the pages dataset. This opens up so many new areas for analyzing HTTP Archive data!

How to Extract Wappalyzer Info from HTTP Archive
The latest pages HAR file, httparchive.pages.2018_04_01_desktop, now contains arrays named _detected and _detected_apps (thanks @pmeenan). Here’s an example of what these arrays look like. Note that the categories and apps will vary per site, so you may need to explore a bit to find what you are looking for.

"_detected": {
    "JavaScript Frameworks": "Moment.js,RequireJS,Underscore.js,jQuery,jQuery Migrate,jQuery UI",
    "Ecommerce": "Magento 2",
    "Web Frameworks": "Bootstrap",
    "Web Servers": "Apache",
    "Programming Languages": "PHP"
},
"_detected_apps": {
    "jQuery": "",
    "Bootstrap": "",
    "jQuery UI": "",
    "jQuery Migrate": "",
    "Moment.js": "",
    "Underscore.js": "",
    "Magento": "2",
    "Apache": "",
    "PHP": "",
    "RequireJS": ""
},

Using JSON_EXTRACT(payload,"$._detected.Ecommerce") we can get a list of all ecommerce vendors, or JSON_EXTRACT(payload,"$._detected_apps.Magento") can detect a specific Ecommerce vendor. You can see an example of this in another discussion thread on the HTTP Archive forum, where I identified sites based on URL patterns and cookie names and then later simplified the query using Wappalyzer data.

So let’s look at a simple example of how you can use this, and then expand on what else we can learn by incorporating other data from the HTTP Archive…

What are the most popular JavaScript Frameworks?
To summarize by JavaScript framework previously, you would have had to know exactly which frameworks to look for, the URL patterns for each of them, and then query the HTTP Archive requests table to identify them. Then you would need to join that to the pages table to summarize the results. That’s a fair amount of work.

Below is a query that simply extracts the “JavaScript Frameworks” variable from the Wappalyzer results and summarizes by the number of sites using them:

SELECT JSON_EXTRACT(payload,"$._detected.JavaScript Frameworks") jsframework,
       count(*) freq
FROM `httparchive.pages.2018_04_01_desktop`
GROUP BY jsframework
ORDER BY freq DESC

When we run this, we can see that JQuery is by far the most frequently used JS framework. There are also lot of different versions, so some of the other frameworks might be buried under the noise of all the versions numbers.

I wanted to summarize this without the version numbers, so I used a simple regular expression to strip them out. In the example below, I removed spaces, periods, and numbers from the list of Frameworks

SELECT  REGEXP_REPLACE(
              JSON_EXTRACT(payload,"$._detected.JavaScript Frameworks"),
                            r"([0-9.\"\s]+)", 
                            "") jsframework,
        count(*) freq
FROM `httparchive.pages.2018_04_01_desktop`
GROUP BY jsframework
ORDER BY freq DESC

I was actually really suprised to see that React was so much more commonly used compared to Angular. Also, there appear to be more sites using jQuery than there are sites not using any Framework at all !!!

image

Analyzing CPU Time Patterns Across JS Frameworks
So now that we’ve identified the most popular JavaScript frameworks - let’s try to assess their cost. Last year @addyosmani wrote a fantastic article titled “The Cost of JavaScript”, in which he demonstrated how increasing complexity of JavaScript has a direct impact on the user experience. A noteable quote from his article:

Spending a long time parsing/compiling code can heavily delay how soon a user can interact with your site. The more JavaScript you send, the longer it will take to parse & compile it before your site is interactive.

In the same HAR file that we extracted the Wappalyzer info, we can also see the wall time stats for JavaScript execution times. I decided to take a look at 3 of these times:

  • Compile
  • Parse (cpuEvaluateScript)
  • Execute (cpuFunctionCall)

This query may seem a bit more complex then the earlier example, so here what changed from the previous query - .

  • This uses the same regular expression and JSON_EXTRACT() to extract the JavaScript framework
  • It uses JSON_EXTRACT() to extract the CPU times we are interested in, and then CAST() converts the string value to an integer.
  • The framework identification is moved to a subquery that extracts all of this information along with the timings. Then we are able to aggregate the results in the outer query.
SELECT jsframework,
       count(*) freq,
       APPROX_QUANTILES(compile, 100)[SAFE_ORDINAL(50)] compile,  
       APPROX_QUANTILES(cpuEvaluateScript, 100)[SAFE_ORDINAL(50)] cpuEvaluateScript . 
       APPROX_QUANTILES(cpuFunctionCall, 100)[SAFE_ORDINAL(50)] cpuFunctionCall
FROM (
  SELECT  REGEXP_REPLACE(
              JSON_EXTRACT(payload,"$._detected.JavaScript Frameworks"),
              r"([0-9.\"\s]+)", 
              "") jsframework,
        CAST(JSON_EXTRACT(payload, "$['_cpu.v8.compile']") as INT64) compile,
        CAST(JSON_EXTRACT(payload, "$['_cpu.FunctionCall']") as INT64) cpuFunctionCall,
        CAST(JSON_EXTRACT(payload, "$['_cpu.EvaluateScript']") as INT64) cpuEvaluateScript
  FROM `httparchive.pages.2018_04_01_desktop`
)
GROUP BY jsframework
ORDER BY freq DESC

And now we can see not only the counts of websites using each Framework - but also the median CPU times.

image

From a quick glance at the table above, the CPU eval costs for React seem to be higher than the others. But to be sure, let’s add a few more percentiles to the query. Below I’ve added the 75th percentile and 95th percentile CPU timings to the query -

SELECT jsframework,
       count(*) freq,
       APPROX_QUANTILES(compile, 100)[SAFE_ORDINAL(50)] compile,  
       APPROX_QUANTILES(cpuEvaluateScript, 100)[SAFE_ORDINAL(50)] cpuEvaluateScript,
       APPROX_QUANTILES(cpuFunctionCall, 100)[SAFE_ORDINAL(50)] cpuFunctionCall,
       APPROX_QUANTILES(compile, 100)[SAFE_ORDINAL(75)] compile75,  
       APPROX_QUANTILES(cpuEvaluateScript, 100)[SAFE_ORDINAL(75)] cpuEvaluateScript75,
       APPROX_QUANTILES(cpuFunctionCall, 100)[SAFE_ORDINAL(75)] cpuFunctionCall75,
       APPROX_QUANTILES(compile, 100)[SAFE_ORDINAL(95)] compile95,  
       APPROX_QUANTILES(cpuEvaluateScript, 100)[SAFE_ORDINAL(95)] cpuEvaluateScript95,
       APPROX_QUANTILES(cpuFunctionCall, 100)[SAFE_ORDINAL(95)] cpuFunctionCall95
FROM (
  SELECT  REGEXP_REPLACE(
              JSON_EXTRACT(payload,"$._detected.JavaScript Frameworks"),
              r"([0-9.\"\s]+)", 
              "") jsframework,
        CAST(JSON_EXTRACT(payload, "$['_cpu.v8.compile']") as INT64) compile,
        CAST(JSON_EXTRACT(payload, "$['_cpu.FunctionCall']") as INT64) cpuFunctionCall,
        CAST(JSON_EXTRACT(payload, "$['_cpu.EvaluateScript']") as INT64) cpuEvaluateScript
  FROM `httparchive.pages.2018_04_01_desktop`
)
GROUP BY jsframework
ORDER BY freq DESC

Here’s a break down of the top 25 frameworks detected by Wappalyzer, along with the median, 75th percentile and 95th percentile CPU costs from the pages using them. I’ve used data bars within the table to show how the timings relate to each other.

A few observations from this table:

  • The CPU times for JQuery are consistent with the CPU times for sites not using a Framework at all.
  • The CPU Function Call times for React and Angular are consistently higher than all the others.
  • The CPU Eval times for React applications appear to be consistently higher than the others.
  • The CPU Eval times for sites using React and JQuery together do not appear to be much higher than using just React
  • The wide range between the median and 75th percentile CPU timings is very interesting, and makes me wonder if examples of how to use these frameworks in more (or less) performant ways are buried in the details (hint: this can also be queried via the HTTP Archive by expanding on this example more )

What About Mobile?
CPU times are even more of an issue with mobile because of the hardware differences. With Android devices we often see a very significant differences in the user experience based on the type of device being used. For example, last year I was able to find a rather significant difference in the onLoad times for mobile devices in Akamai’s mPulse real user measurement data -

To query this data for mobile, simply replace desktop with mobile in all the queries above (ie, httparchive.pages.2018_04_01_mobile). When we run the previous query against the mobile dataset, we can see similar stats - although the numbers are much higher (as expected). The HTTP Archive stats for mobile are using a single emulated Android device using the Chrome browser. But based on the device fragmentation map above I would expect the below results to vary greatly by device type as well.

Based on this analysis it definitely seems that the choice of JavaScript framework will have a significant impact on CPU execution times, which will ultimately impact your end user’s experience. However the rather large range of CPU times between the median and 75th percentile give hope that some of these frameworks can be tuned further or that the way that they are being used can be adjusted to limit their impact on the end user.

There’s a lot more that we can dig into with the Wappalyzer data in the HTTP Archive. Feel free to share any insights you find here in the HTTP Archive discussion forums!


#2

I was wondering what Categories Wappalyzer has data for in the HTTP Archive, and how many datapoints there are for each.

Here’s a query that helps determine that. It works by extracting all of the name:value pairs in the _detected JSON array into a string, and then uses regular expressions to remove everything except the category names. It also uses CROSS JOIN to unpack the arrays into rows so that they can be summarized -

SELECT REGEXP_EXTRACT(detected, r"(.*)\:") category, count(*) freq
FROM (
   SELECT SPLIT(
            REGEXP_REPLACE(
              JSON_EXTRACT(payload,"$._detected"), 
              r"([\{\}\"]+)", 
              ""), 
            ',') detected 
   FROM `httparchive.pages.2018_04_01_desktop`
   WHERE JSON_EXTRACT(payload,"$._detected") IS NOT NULL and JSON_EXTRACT(payload,"$._detected") <> "[]"
 ) AS detected_array
CROSS JOIN detected_array.detected
GROUP BY category
ORDER BY freq DESC

The output of the query looks like this and there are 60 categories -

image

Full results below:

|category|freq|
|---|---|
||699620|
|Web Servers|362239|
|Analytics|345044|
|JavaScript Frameworks|342234|
|Programming Languages|249373|
|Font Scripts|205712|
|Widgets|197935|
|Web Frameworks|156818|
|CMS|145818|
|Advertising Networks|118573|
|Operating Systems|114498|
|Blogs|113221|
|CDN|74331|
|Tag Managers|57670|
|Cache Tools|49862|
|SEO|35097|
|Video Players|33615|
|Miscellaneous|32054|
|Ecommerce|23059|
|Web Server Extensions|20552|
|Maps|16959|
|Marketing Automation|15508|
|Live Chat|15478|
|Hosting Panels|13057|
|JavaScript Graphics|7842|
|Message Boards|6427|
|Editors|5618|
|Payment Processors|3681|
|Databases|2234|
|Photo Galleries|1792|
|Mobile Frameworks|1462|
|Captchas|926|
|Dev Tools|827|
|CMS:Sitefinity 3.7.2136.240|738|
|Wikis|681|
|Comment Systems|572|
|Landing Page Builders|571|
|Cryptominer|562|
|Rich Text Editors|559|
|Static Site Generator|369|
|Search Engines|263|
|Documentation Tools|167|
|LMS|115|
|Issue Trackers|96|
|CRM|48|
|Document Management Systems|36|
|Web Mail|13|
|Feed Readers|7|
|Printers|6|
|CMS:Sitefinity 3.7.2096.2|6|
|Database Managers|2|
|Cache Tools:Google PageSpeed Powered by ngx_pagespeed. Server administration KDTeam http|1|
|Web Server Extensions:Google PageSpeed Powered by ngx_pagespeed. Server administration KDTeam http|1|
|CMS:Sitefinity 3.7.2096.4|1|
|Build CI Systems|1|
|CMS:Sitefinity 3.6.1936.2|1|
|CMS:Sitefinity 3.7.2136.2|1|
|CMS:Sitefinity 3.7.2136.220|1|
|CMS:Sitefinity 3.7.2057.2|1|
|Media Servers|1|

#3

Is it possible to retrospectively run Wappalyzer on previous crawls?


#4

This is weird. There are only 462,646 pages in the table. What does it mean to have 699,620 nulls?

Great question. The answer is sort of. Wappalyzer’s detections are run during the crawl in WebPageTest. We can’t crawl back in time but we can use the HAR artifacts in BigQuery as a representation of the web page and run the detection logic against that. The HAR data only goes as far back as 2016 though. We’re tracking this work in https://github.com/HTTPArchive/bigquery/issues/19.

We just need to convert the Wappalyzer detection signals into boolean expressions in BigQuery, using the HAR and page metadata as input, similar to Paul’s manual detection of Magento. Then we need to figure out where to save the results. Ideally we’d make it consistent with the WPT-based JSON data in the HAR files, but that requires overwriting 50 million rows :stuck_out_tongue:. Maybe we’ll extract the results into a separate dataset?


#5

Ah! I see what happened there.

In my query I split based on a “,” but some of the Wappalyzer values contained comma separated values. Those got wrapped to a second line but w/o a corresponding value they were nullified by the regular expression. Since the query was ignoring the values and focusing on the categories - the category list is still correct. The null count is meaningless because of this too.