Counting WordPress sites since 2016 using Wappalyzer detection

One goal of Integrate Wappalyzer platform detection (#19) on GitHub is to find a way to use BigQuery as the engine to drive Wappalyzer’s detections. In that thread, I shared an enormous 83 TB query that:

  • scans all response bodies and response headers of root documents since we started collecting them in 2016 and
  • extracts the pages detected to contain signs of WordPress (or at least the signs of something that depends on WordPress, like a plugin).

The lynchpin of the query is that I converted Wappalyzer’s JSON-based rules into a table on BigQuery: https://bigquery.cloud.google.com/table/httparchive:wappalyzer.apps

Here are the Wappalyzer rules for WordPress:

    "WordPress": {
      "cats": [
        1,
        11
      ],
      "js": {
        "wp_username": ""
      },
      "html": [
        "<link rel=[\"']stylesheet[\"'] [^>]+wp-(?:content|includes)",
        "<link[^>]+s\\d+\\.wp\\.com"
      ],
      "icon": "WordPress.svg",
      "implies": [
				"PHP",
				"MySQL"
			],
      "meta": {
        "generator": "WordPress( [\\d.]+)?\\;version:\\1"
      },
      "script": "/wp-includes/",
      "website": "http://wordpress.org"
    }

And their representation on BigQuery:

image

  {
    "meta": [
      {
        "value": "WordPress( [\\d.]+)?",
        "name": "generator"
      }
    ],
    "scripts": null,
    "headers": [],
    "html": [
      "<link rel=[\"']stylesheet[\"'] [^>]+wp-(?:content|includes)",
      "<link[^>]+s\\d+\\.wp\\.com"
    ],
    "categories": [
      "CMS",
      "Blogs"
    ],
    "name": "WordPress",
    "script": [
      "/wp-includes/"
    ],
    "url": null,
    "implies": [
      "PHP"
    ]
  }

So for example we can run REGEXP_CONTAINS queries using the html regular expressions against 2 year old response bodies.

In the most recent desktop crawl, it directly or indirectly detected WordPress on 199,022 URLs. It directly detected WordPress on 103,328 URLs.

Plotting the detections over time for desktop/mobile:

image

It’s a bit wonky. Making sense of every dip and spike requires knowing all the pipeline bugs over the years. But the 199,022 figure seems reasonable?? It’s been in that neighborhood for about a year.

Here’s the same thing but only the direct detections:

image

A few interesting observations:

1/12016 - 5/1/2017 There’s a steady decline for both desktop and mobile. Hard to tell what the cause is.
7/1/2016 Mobile is late to the party.
11/1/2016 Big drop, but recovered. Side effect of a pipeline error?
5/1/2017 Influx of positive detections. Did our crawl artifacts expand coverage to allow for more detections?
5/1/2017 - present Accounting for the big jumps, there still seems to be a decline. And the gap between desktop and mobile is widening.
7/15/2017 Another big spike, but mostly contained to direct detections on desktop. Also on this date implied mobile detections dropped.
2/1/2018 Spike in mobile detections. This looks like a correction of the drop noted above.
3/15/2018 Huge drop, but recovered. Side effect of a pipeline error?

In @paulcalvano’s awesome analysis of the BigQuery metadata, he noted a few anomalies that correlate with some of these. For example, the 7/15/17 drop in mobile detections could be explained by missing data in the android_request_bodies tables.

Next I’ll be analyzing performance data about the WordPress sites and comparing that to the baseline.

Depending on the exact URLs you are hitting there may be a few HTTP response headers that indicate WordPress being used.

X-Pingback : An example looks like: X-Pingback: https://matt.blog/xmlrpc.php

Link : This one is from Jetpack, and looks like: Link: <https://wp.me/p4-23J>; rel=shortlink

There is another one specific to WordPress.com, X-hacker. You can see all three of these in action by looking at the response headers for https://matt.blog/2018/02/11/palm-trees-and-sunset/

Based on your current response body searches, I don’t know that these headers would catch any additional sites. I’ll leave them here for easy reference if they are helpful in any way.

Thanks @josephscott! You may want to consider submitting a PR to Wappalyzer to add these signals to their WordPress detection. We’d benefit from the improvements downstream.