It’s a cool idea. My only worry would be the scalability: it’s easy to list a few libraries, but maintaining this list can be a rather tedious process. That’s not to say that we shouldn’t do it – for that, I’ll defer to @souders and @pmeenan.
As Ember (and other libraries) move to the modules, these kinds of analyses will get increasingly difficult. The Ember community is already disproportionately undercounted in request-based and globals-based analyses, and this divide is only going to increase over the next year.
I wonder if it is worth keeping as a “design constraint” for Ember and other libraries a simple js detection mechanism (eg leave window.Ember.VERSION around even if using modules.
Very curious to see what the “top 10k alexa” sites would reveal with a simple js sniffing analysis and how far it would differ from the request sniffing mechanism. How I wish I had time for this, @igrigorik do you have an intern that could look at this
The fact that people have been putting out the request-based data with such serious deficiencies makes me doubt that there is high interest in doing a better analysis.
The request-based analysis is simply wrong. It’s like trying to count America’s income distribution by looking at people leaving Whole Foods. Frameworks used via CDNs are far more likely to be smaller frameworks without ecosystem-wide interest in build tools, which makes this analysis completely skewed.
Yes, I am frustrated that people are giving it credence, and yes, I am frustrated that smart people like @igrigorik are putting the data out there as an “interesting data point”.
I would love to see serious interest in doing this right, and would be quite interested in helping to figure out what we could do to make this analysis more broadly accurate.
In absence of other methods, even a really poor one becomes an “interesting” crutch… Request based analysis is far from perfect, but I don’t think it’s entirely without merit either – it’s a lower bound. That said, I do agree that we can do better by doing some basic content analysis.
Any tips for what we could/should look for in JS files to (more reliably) fingerprint use of various libraries?
I chose my analogy carefully. Respectfully, I disagree that it provides particularly useful information.
It may provide a “lower bound”, but people are using this data for analyses, and virtually all of those analyses are simply inappropriate.
I want to be clear: I am not “unskewing” and saying that Ember is more popular than Angular or something like that. But the kinds of analyses people are doing (year-over-year growth, for instance) require fairly accurate, fine-grained data, and the kind of data we’re seeing here is far more noise than signal.
Today, most libraries export a global variable you can hunt for, but as ES6 transpilers become more popular and stable, we’re going to see a significant uptake in “bundled” builds that completely eliminate all globals.
That’s what @sam was talking about when he said that maybe we can all agree to still do something that could be analyzed even in bundled builds.
In parallel… I’ve imported HTML, JS, and CSS response bodies for a few crawls (Aug 1 and Aug 15, 2014) into BigQuery, which means we can now actually peak inside of the files themselves. Any tips for what I could or should look for in these files? Unfortunately I can’t eval them, like you’re doing above, but we can run arbitrary analysis on the content itself.
Hard, but perhaps not impossible? Seems like we should be able to write some queries to detect it… Running things through phantom would require exporting all the data and executing it, which is a big undertaking – one off, sure, but hard to scale to run on every crawl.
I don’t think text analysis is workable httparchive is already running all the js so all we really want is to inject a bit of js to run after all the data is collected, seems pretty low impact on the actual test run times.
The always amazing @pmeenan has built the ability to run some JS at the end of each run. This would be possible but it would be a lot of work to then store the data esp. going forward. As was pointed out in this thread, do we stop at 5 frameworks? 10? 100?
There is a great litmus test we could use, https://cdnjs.com/ is good list of “common” js libraries.
Ideally if the script for detection lived in GitHub I would be very happy to kick it off with 20 or so detections and then community can add more as they see fit, cost of detecting a lib is so cheap it does not matter if you do 20 vs 1000.
Have a public repo (or sub dir) that contains custom “user scripts” that run at the end of each page load. Said scripts are automatically synced before each crawl - i.e. there is a simple way to add new scripts and a process for getting them deployed.
Each script is allowed to emit a single key, and an arbitrary string value, which is automatically imported into our results (this logic is already in place for BigQuery, since I generate the schema on each import)
Assuming above is present, we could implement the framework functionality as a single script that runs at the end of each page load: the script enumerates some list of frameworks (submit a pull request to update the list in the future), and emits a “framework-name: version” tuple, and all the tuples are then joined into a single string - e.g. “jquery: 3.0.0, angularjs: 1.3.1, …”. This allows us to export an arbitrary number of variables while keeping the number of high-level keys reasonably small. The values in the comma separated string are trivial to analyze with BigQuery - SPLIT() on comma and you’re off to the races.