State of the WordPress ecosystem


#1

The topic of WordPress performance has come up on this forum in the past, but we haven’t really had a good answer for it at the time. We’re looking to integrate Wappalyzer into our data processing pipeline, but there is still some engineering work required to make that a reality.

We can still emulate Wappalyzer’s WordPress detection directly in BigQuery. For example:

SELECT
  DISTINCT page
FROM
  `httparchive.har.2017_10_15_chrome_requests_bodies`
WHERE
  REGEXP_CONTAINS(body, "<link rel=[\"']stylesheet[\"'] [^>]+wp-(?:content|includes)")

:warning: This query consumes 76% of your free monthly quota! Tip: go to bit.ly/ha50 for an additional 10TB free while supplies last.

This query takes one of the signals in Wappalyzer (a stylesheet link with wp-content or wp-includes in the resource URL) and finds all pages on which that link is used. There are ~90k pages (18%) matching this pattern. It’s not perfect and other signals are needed to definitively determine if a site uses WordPress, but it’s a good start considering BuiltWith estimates the top 500K to be somewhere between 20-30% WordPress. For example, https://techcrunch.com/ is a known WordPress site tracked by HTTP Archive, but it’s not in the list of 90k despite having links containing the detected keywords, because none of them are rel="stylesheet".

Don’t run the query above. Instead, you can go directly to the results that have been saved to a scratchspace table: https://bigquery.cloud.google.com/table/httparchive:scratchspace.wordpress?tab=preview

You can join this table with the har/runs datasets to find out more about WordPress performance. For example:

SELECT
  url IN (SELECT url FROM `httparchive.scratchspace.wordpress`) AS wordpress,
  ROUND(APPROX_QUANTILES(bytesTotal, 1001)[OFFSET(501)] / 1024) AS totalKB
FROM
  `httparchive.runs.2017_10_15_pages`
GROUP BY
  wordpress

We get 1590 KB for wordpress=false and 1897 KB for wordpress=true. So this simple example demonstrates that WordPress pages tend to be heavier than non-WordPress pages. This also begs more questions like what makes it heavier (scripts, images, videos, etc) and what needs to be done to fix it (minification, caching, service workers, etc).


I’d really love the help of the HTTP Archive community to dig into this data more and find other interesting conclusions about the WordPress ecosystem. @amedina and I will be travelling to Nashville in a couple of weeks to share these findings at Wordcamp US.

I’ll be updating this thread with more analyses as they happen. Feel free to reply if you’ve found anything interesting!

Some areas for exploration should you need any ideas:

  • Do images tend to be less optimized?
  • Is there a greater reliance on third parties? What is the relative effect?
  • Do WordPress pages tend to have more JS vulnerabilities? (see the new vulnerability audits in the Lighthouse results)
  • Do WordPress pages tend to have more a11y audit failures? (again, see Lighthouse data)
  • Join with the origins in the Chrome UX Report to find out relative real user performance.
  • Are WordPress websites more or less likely to do “coinjacking” (background currency mining)?

We could ask these questions (and many many more) of any group of websites. Some WordPress-specific areas to explore include detected plugins and themes and how they influence performance, WordPress hosting services, and enterprise/VIP sites compared to other WordPress sites.


#2

Here are the Lighthouse a11y scores for WordPress vs all other pages.

SELECT
  url IN (SELECT url FROM `httparchive.scratchspace.wordpress`) AS is_wordpress,
  JSON_EXTRACT_SCALAR(report, '$.reportCategories[2].score') AS a11y_score,
  COUNT(0) AS volume
FROM
  `httparchive.lighthouse.2017_10_15_mobile`
WHERE
  report IS NOT NULL
GROUP BY
  is_wordpress,
  a11y_score
ORDER BY
  is_wordpress,
  a11y_score

Also remember that Lighthouse results are only available for mobile.

image

a11y scores for WordPress pages tend to be more tightly packed into scores in the 85-90 range, while non-WordPress pages have more of a spread in the higher and lower scores. Maybe this could be explained by WordPress pages being built on top of a core template with many of the a11y best practices built in? Someone more experienced with standard deviations, help me explain this better :stuck_out_tongue:. The median score isn’t a very helpful metric because there are so few distinct scores (only 15). FWIW each page type has a median score of 85.71.

Another way of looking at the data is to set some threshold for a good score and measure the percent of scores that are better than it. For a score threshold of 90:

    WordPress: 29%
Not WordPress: 33%

But because of the big WordPress mode at 85.71, lowering the threshold to 85 shows WordPress being the winner:

    WordPress: 72%
Not WordPress: 69%

The thresholds are arbitrary, but this data shows that WordPress pages are less likely to get the really good or perfect a11y scores.


#3

Here’s a look at image usage on WordPress compared to the rest of the web.

#standardSQL
SELECT
  url IN (SELECT url FROM `httparchive.scratchspace.wordpress`) AS is_wordpress,
  APPROX_QUANTILES(CAST(JSON_EXTRACT(payload, '$._image_savings') AS INT64), 1001)[OFFSET(501)] AS median_img_savings
FROM
  `httparchive.har.2017_10_15_chrome_pages`
WHERE
  payload IS NOT NULL
GROUP BY
  is_wordpress
ORDER BY
  is_wordpress

This query compares the median number of image bytes that could be saved for WordPress sites vs others. Tweaking it slightly, we can get this as a percent of all image bytes:

#standardSQL
SELECT
  url IN (SELECT url FROM `httparchive.scratchspace.wordpress`) AS is_wordpress,
  SUM(CAST(JSON_EXTRACT(payload, '$._image_savings') AS INT64)) / SUM(CAST(JSON_EXTRACT(payload, '$._image_total') AS INT64)) AS pct_img_savings
FROM 
  `httparchive.har.2017_10_15_chrome_pages`
WHERE
  payload IS NOT NULL
GROUP BY
  is_wordpress
ORDER BY
  is_wordpress

The results are relatively positive for WordPress.

				Wasted Image KB		Wasted Image %
Not WordPress	63.57				26.03%
WordPress		57.83				19.98%

While images on WordPress could save a median of ~58 KB with image optimization, this is still marginally better than non-WordPress sites. Even more significant is the % of wasted bytes. ~1/4 of image bytes on non-WordPress sites are wasteful whereas ~1/5 of image bytes on WordPress are wasteful. Still not great, but better by comparison.

@amedina I wonder if this has to do with the availability of tools/plugins in the WordPress ecosystem that optimize images effortlessly, as opposed to other web development pipelines that require manual image optimization or installation of specialized tools. The next question is how can we lower these numbers? Here are some possible scenarios:

  • WordPress developers are not aware of the issue and the availability of tools to optimize images.
  • The tools are not maximally efficient/sophisticated. For example, optimizing file formats and quality settings.

One thing to note is that WPT calculates image savings for JPEGs by comparing them to optimized versions at 85% quality and similarly comparing PNGs against optimized versions. It also checks GIF savings when converted to PNG. This doesn’t currently support optimizations with WebP.

Another aspect of images that we can analyze is the relative proportion of the different image formats. The requests tables include a mimeType field in which images are prefixed by image/. So for example, to get the image format, we just take the suffix of those mimeTypes:

SELECT
  REGEXP_EXTRACT(LOWER(mimeType), 'image/([\\w-]*)') AS type
FROM
  `httparchive.runs.2017_10_15_requests`

Getting the proportions for each image format grouped by (Not) WordPress was a multi-step process with a couple of intermediary tables, so it’s not as straightforward to show as a single query. This was necessary because the requests tables are ~50GB each, and I needed to join the requests tables in the runs and har datasets in order to get both the mimeType and page URL.

Type		Not WordPress	WordPress
png			63.49%			67.82%
gif			29.83%			23.99%
svg			5.92%			7.49%
jpeg		0.47%			0.42%
bmp			0.19%			0.16%
webp		0.08%			0.11%

There are some interesting observations we can make. For example, GIF usage is lower on WordPress websites while PNG and SVG usages are higher. There is also a very slight bump in WebP usage, but we’re talking 3/100 of a percent so not very significant. The high-level meta observation is that image format usage on WordPress is generally the same as Not Wordpress; there are some differences but nothing shocking.


#4

I wonder if this is related to use of Photon on both WordPress.com and other WordPress through Jetpack+Photon.


#5

Could be! I think there’s a correlation between the level of image optimization and the availability and ease of use of optimization tools. A simple WordPress plugin would be a huge advantage.


#6

I just tested with regular WordPress: Uploaded a 2.1 MB JPG. The images I see in Chrome are WebP, and resized to fit my screen. Simply uploading and using an image in WordPress allows for simple optimizations:


#7

Wait it’s WebP? So how could it also be saved at JPEG quality 96? So weird that it’s still got the .jpg extension.

I’m skeptical if this is the default behavior, because we still see ~4x more JPEGs than WebPs even when HTTP Archive is running WPT in Chrome. Maybe this is new behavior and most of the WordPress sites in HTTP Archive are running older software?


#8

Good point on the JPG compression: I think when I right clicked and downloaded, I must have gotten a different file. since Macs do not support WebP.

It does appear that there is some algorithm for deciding if the image should be webp vs. jpg.


#9

The test site - dougsillars.com - is on WordPress.com, which uses Photon for images. One of the features of Photon is that it can automatically convert JPEGs to WebP. From the docs on the quality setting:

Note that if the requesting web browser supports the WebP image format, then PNG and JPEG images will automatically be converted to the WebP image format by the server.

Note that the default quality setting for JPEGs is 89%, PNGs 80%, and WebP images is 80%.

You can see this in action with WebPageTest - https://www.webpagetest.org/result/171128_NF_619995643693d3b1b5911a9c312fa3bd/1/details/#waterfall_view_step1 - request #21 has a response header that includes content-type: image/webp.

For sites hosted on WordPress.com this all happens automatically. For WordPress sites hosted else where, Photon can still be used via the Jetpack plugin.


#10

Hey, @rviscomi! :raised_hands:

:100: It’s an incredibly exciting project of HTTP Archive project. I have a few more suggestions that we could test on for different WordPress based websites and find out about more insights that can help make the web better and could be a good thing for open web telemetry.

  • WordPress REST API comparison with sites that do use it (not sure about the query for this one, but I expect compelling insights there. Especially when compared with websites using WordPress and not exploring the REST API yet)
  • How does WordPress sites with different plugins stack against each other (or non-WordPress sites), E.g., using a premium plugin like Gravity Forms as compared to the free plugin Contact Form7 — for handling the form submission data and then comparing it with other non-WP sites
  • WordPress vs. Non-WordPress — Responsive Images comparison (WordPress takes care of making the images responsive and that’s huge — not many custom sites care about solutions like this)
  • WordPress vs. Non-WordPress — No. Responsive sites? Media-queries being used
  • WordPress vs. Non-WordPress — Meta-data richness for SEO purposes? & a11y!
  • WordPress vs. Non-WordPress — Average page size on site
  • WordPress vs. Non-WordPress — Word count and links per page analysis again for SEO meta-data
  • WordPress vs. Non-WordPress — Critical Path Analysis, above the fold content and first print handling
  • WordPress vs. Non-WordPress — SSL usage

I look forward to playing around with this project with more ideas like the ones listed above. :rocket:
Cheers!


#11

Thanks @mrahmadawais those are great suggestions!

I started looking into responsive images. The Properly Size Images Lighthouse audit is a great fit for this - it measures the number of wasted KB when oversized images are served on mobile.

Here’s the query:

SELECT
  url IN (SELECT url FROM `httparchive.scratchspace.wordpress`) AS is_wordpress,
  APPROX_QUANTILES(CAST(JSON_EXTRACT(report, '$.audits.uses-responsive-images.extendedInfo.value.wastedKb') AS INT64), 1001)[OFFSET(501)] AS resp_img_wasted_kb
FROM
  `httparchive.lighthouse.2017_10_15_mobile`
WHERE
  report IS NOT NULL
GROUP BY
  is_wordpress
ORDER BY
  is_wordpress

Surprisingly, WordPress pages have a median of 49 KB of wasted image bytes, compared to just 27 KB for non-WordPress pages. :thinking:


#12

I’m late to the party :slight_smile: . Some great analysis here!

I decided to try and tackle some questions related to plugin usage.

  • How many plugins are there?
  • What are the most popular Wordpress Plugins?
  • How many HTTP requests are generally loaded along with the most popular plugins?

So here’s a query that runs against the HTTP Archive requests table, uses a regular expression to extract the path after /plugins/ for sites containing both wp-content and plugins paths. For most cases, this should be the plugin name.

-- Standard SQL, will process 6.23GB of data
SELECT REGEXP_EXTRACT(LOWER(url), r"\/plugins\/([^\/]+)\/") plugin,
      count(*) requests, 
      count(distinct pageid) pages
FROM `httparchive.runs.2017_11_01_requests` 
WHERE LOWER(url) LIKE "%/wp-content/%plugins/%" 
GROUP BY plugin
HAVING plugin <> "null"
ORDER BY pages DESC

My first two questions can be answered by just looking at the results:

There are 25,672 rows, which means that there are that many plugins installed on sites! And paginating through the results, I can see that the top 536 plugins are installed on at least 100 pages. The top 50 plugins are installed on at least 1,000 pages. In other words the long tail is very very long.

As for the most popular plugins: contact-form-7, jetpack, js_composer are by far the most popular.

In order to answer that last question, I added a clause to limit the results and downloaded the CSV file. THen I was able to graph both the number of pages and requests for the top plugins. The way I chose to represent it in the graph below was to use a bar chart to represent the number of pages a plugin is installed on, and a data point layered on top of the bar to indicate how many requests this consisted of across the entire population.

Note that the Y axis for hte requests is 2x of the pages. So if the data point is at the same level of the bar, then that’s 2 requests per plugin. If it’s in the middle, then it’s 1 request per plugin.

It looks like the most popular plugins require more HTTP requests on pages where they are installed.


#13

That’s strange. It means the small size images are not optimized enough. If only we could separate the research for different versions of WordPress. But many sites remove the version’s public footprint. :thinking:


#14

in my experience there’s a 90% chance to find the WP version in the feed page of sites running WP

https://domain.com/feed/

I am pretty sure that would cover enough sites to check if there are any correlations with the WP version…


#15

Yoast is one of the most popular plugins but is a false negative in that query. FWIW it can be detected by looking for “Yoast SEO” in the response bodies. I actually met Joost (his company is named after his name’s phonetic spelling) at WordCamp this weekend and ran the query with him and found ~30k pages. He verified that 30k out of 90k WordPress sites sounds about right (1/3 of market share).


#16

Interesting! The only problem is that HTTP Archive only saves the requests/responses for resources included on the initial load of the home page. So if /feed/ is not normally downloaded we wouldn’t have any visibility into it. I wouldn’t expect to see it in the results :worried:


#17

@rviscomi what would you want the plugin to do?


#18

Nothing new; Photon seems to be doing a good job. I might also throw in webp support if possible.