Tracking JavaScript library versions in HTTP archive

Question for @souders @igrigorik

I was thinking that it would be super awesome to figure out which JS libraries are out there during http archive sweeps, for example here you could run this detection:

var emberVersion = window.Ember && window.Ember.VERSION;
var jQueryVersion = window.jQuery && window.jQuery.fn && window.jQuery.fn.jquery;

Keeping a full list of all these puppies may be a bit of work, but catching the top 20 or so libraries is fairly simple and cheap.

In the age of asset bundling simply looking for requests is very inaccurate.

Thoughts?

2 Likes

It’s a cool idea. My only worry would be the scalability: it’s easy to list a few libraries, but maintaining this list can be a rather tedious process. That’s not to say that we shouldn’t do it – for that, I’ll defer to @souders and @pmeenan.

I’d be more concerned about how we represent it in the database. Would there be 20 new columns for the libraries? Where would we draw the line?

At some point would we be intersecting with builtwith? http://trends.builtwith.com/

builtwith is kind of evil, you need to create an account to even see the list of libraries:

“Log in or Sign up for Free to see double the amount of technologies in this list.”

Even if you doubled it you would not cover quite a few high profile libraries.

Why not simply create a new table for this?

httparchive:runs.latest_pages_javascript_libraries_detected 
run_id : int
library_name: string
library_version: string

Or something along those lines.

As Ember (and other libraries) move to the modules, these kinds of analyses will get increasingly difficult. The Ember community is already disproportionately undercounted in request-based and globals-based analyses, and this divide is only going to increase over the next year.

My concern here is that many current detection mechanisms often used in bulk just checks for reqs to google CDN or particular file name, etc.

Even @igrigorik put this out there with https://www.igvita.com/2013/06/20/http-archive-bigquery-web-performance-answers/ which falls short once people start bundling and so on.

I wonder if it is worth keeping as a “design constraint” for Ember and other libraries a simple js detection mechanism (eg leave window.Ember.VERSION around even if using modules.

Very curious to see what the “top 10k alexa” sites would reveal with a simple js sniffing analysis and how far it would differ from the request sniffing mechanism. How I wish I had time for this, @igrigorik do you have an intern that could look at this :slight_smile:

The fact that people have been putting out the request-based data with such serious deficiencies makes me doubt that there is high interest in doing a better analysis.

The request-based analysis is simply wrong. It’s like trying to count America’s income distribution by looking at people leaving Whole Foods. Frameworks used via CDNs are far more likely to be smaller frameworks without ecosystem-wide interest in build tools, which makes this analysis completely skewed.

Yes, I am frustrated that people are giving it credence, and yes, I am frustrated that smart people like @igrigorik are putting the data out there as an “interesting data point”.

I would love to see serious interest in doing this right, and would be quite interested in helping to figure out what we could do to make this analysis more broadly accurate.

1 Like

In absence of other methods, even a really poor one becomes an “interesting” crutch… Request based analysis is far from perfect, but I don’t think it’s entirely without merit either – it’s a lower bound. That said, I do agree that we can do better by doing some basic content analysis.

Any tips for what we could/should look for in JS files to (more reliably) fingerprint use of various libraries?

I chose my analogy carefully. Respectfully, I disagree that it provides particularly useful information.

It may provide a “lower bound”, but people are using this data for analyses, and virtually all of those analyses are simply inappropriate.

I want to be clear: I am not “unskewing” and saying that Ember is more popular than Angular or something like that. But the kinds of analyses people are doing (year-over-year growth, for instance) require fairly accurate, fine-grained data, and the kind of data we’re seeing here is far more noise than signal.

Today, most libraries export a global variable you can hunt for, but as ES6 transpilers become more popular and stable, we’re going to see a significant uptake in “bundled” builds that completely eliminate all globals.

That’s what @sam was talking about when he said that maybe we can all agree to still do something that could be analyzed even in bundled builds.

Pardon my ignorance, what’s the global? Assuming I can run a string search / regex on the JS payloads, what should I be looking for?

You can run something like this with phantomjs it is a 5 minute attempt at the problem but show the general pattern:

var page = require('webpage').create();
var system = require('system');

var url = system.args[1];
console.log("Testing: " + url);

page.open(url, function(status){
  console.log("Status : " + status);

  var tests = [ "Ember.VERSION",
                "jQuery.fn.jquery",
                "angular.version.full",
                "dojo.version.toString()",
                "React.version",
                "Backbone.VERSION",
                "_.VERSION",
                "MooTools.version"];

  tests.forEach(function(test) {
    var result = page.evaluate(function(test){
      try {
        return eval(test);
      } catch(e){
        return null;
      }
    }, test);

    if(result){
      console.log("Detected " + test + " version: " + result);
    }
  });
  phantom.exit();
});

Some sample results:



sam@ubuntu js_detect % phantomjs detect.js http://bbs.boingboing.net
Testing: http://bbs.boingboing.net
Status : success
Detected Ember.VERSION version: 1.6.0-beta.3+pre.2e4d67f6
Detected jQuery.fn.jquery version: 2.1.1
Detected _.VERSION version: 1.3.0

sam@ubuntu js_detect % phantomjs detect.js http://www.eatout.co.uk  
Testing: http://www.eatout.co.uk
Status : success
Detected MooTools.version version: 1.2.0


sam@ubuntu js_detect % phantomjs detect.js http://nameokay.com
Testing: http://nameokay.com
Status : success
Detected jQuery.fn.jquery version: 1.8.3
Detected angular.version.full version: 1.0.3

sam@ubuntu js_detect % phantomjs detect.js http://stackoverflow.com
Testing: http://stackoverflow.com
Status : success
Detected jQuery.fn.jquery version: 1.7.1

cc @wycats (wow Stack Overflow is on jQuery 1.7.1)

You could easily automate running the above script on say top 10k alexa urls to get a slightly more decent picture.

You would still need to allow for tricky loaders and such but it would be “ball park”

1 Like

Nice! I like that approach.

In parallel… I’ve imported HTML, JS, and CSS response bodies for a few crawls (Aug 1 and Aug 15, 2014) into BigQuery, which means we can now actually peak inside of the files themselves. Any tips for what I could or should look for in these files? Unfortunately I can’t eval them, like you’re doing above, but we can run arbitrary analysis on the content itself.

Fingerprinting is going to be super hard, however if all the data is there we could feed it into phantom after the fact and then ask it.

Hard, but perhaps not impossible? Seems like we should be able to write some queries to detect it… Running things through phantom would require exporting all the data and executing it, which is a big undertaking – one off, sure, but hard to scale to run on every crawl.

I don’t think text analysis is workable httparchive is already running all the js so all we really want is to inject a bit of js to run after all the data is collected, seems pretty low impact on the actual test run times.

@stevesoudersorg I believe we can add custom metrics. WDYT of adding something to run a version check for popular libraries at the end of each run?

The always amazing @pmeenan has built the ability to run some JS at the end of each run. This would be possible but it would be a lot of work to then store the data esp. going forward. As was pointed out in this thread, do we stop at 5 frameworks? 10? 100?

There is a great litmus test we could use, https://cdnjs.com/ is good list of “common” js libraries.

Ideally if the script for detection lived in GitHub I would be very happy to kick it off with 20 or so detections and then community can add more as they see fit, cost of detecting a lib is so cheap it does not matter if you do 20 vs 1000.

@stevesoudersorg @pmeenan have a crazy proposal, hear me out…

  1. Have a public repo (or sub dir) that contains custom “user scripts” that run at the end of each page load. Said scripts are automatically synced before each crawl - i.e. there is a simple way to add new scripts and a process for getting them deployed.
  2. Each script is allowed to emit a single key, and an arbitrary string value, which is automatically imported into our results (this logic is already in place for BigQuery, since I generate the schema on each import)

Assuming above is present, we could implement the framework functionality as a single script that runs at the end of each page load: the script enumerates some list of frameworks (submit a pull request to update the list in the future), and emits a “framework-name: version” tuple, and all the tuples are then joined into a single string - e.g. “jquery: 3.0.0, angularjs: 1.3.1, …”. This allows us to export an arbitrary number of variables while keeping the number of high-level keys reasonably small. The values in the comma separated string are trivial to analyze with BigQuery - SPLIT() on comma and you’re off to the races.

Crazytalk?

1 Like

That would work. I’m open for pull requests.

1 Like