Government Digital Service (GDS) have created and maintain a service called the GOV.UK Design System. It is a set of styles, components and patterns that make UK Government services look like GOV.UK. The code for the styles, components, and patterns is stored in the GOV.UK Frontend repository and is all open source and is free to examine and use.
We as a department encourage all government departments to use it, but we can’t enforce it. As every department is essentially its own separate entity, we have very little visibility on who is actually using it and where it is being used (unless they approach us for support leave us feedback). This creates problems when we are asked to provide numbers on the adoption across both central government services, and more recently local councils too.
Speaking to Rick he has suggested adding detection code to Wappalyzer, a one-time custom metric, and examining static HTML response bodies for HTML patterns that match the Design System’s code base.
Opening this thread to continue the discussion and analysis.
Examining HTML response bodies sounds like a good way to start. However due to the size of that table, I’d suggest creating a smaller subset of it with just the GOV.UK sites that are of interest. This way you can search for adoption across UK Government services first.
A quick query shows 2151 sites with the hostname matching “%.gov.uk”. Are there any other domain names you’d like to see included in an excerpt?
Hi @paulcalvano,
Thanks for the reply. Yes a smaller subset would be a great idea to start with, as that is where we are most likely to see adoption. The *.gov.uk domain would be the only one to include in the excerpt. We do have an issue that services tend to sit on the *.service.gov.uk domain, and our service manual asks these domains to stop search engines crawling them, but I’d guess there isn’t anyway around this?
I’m currently looking into the HTML patterns that will be common to a service that uses the Design System to hang further searches off.
Great. I’ve created a smaller (1.5GB) table that we can use to keep the query costs down. The table httparchive.scratchspace.gov_uk_response_bodies_2020_01_mobile contains the response bodies for 40,183 resources across 2367 .gov.uk sites.
Unfortunately I did not find any service.go.uk domains in there, so these are likely not included in the HTTP Archive.
Thanks Paul! That data set looks much more reasonable that the 2.5TB I saw listed!
So I’ve now raised a PR with Wappalyzer and in there we are hanging our detection of the existence off a <body> with a govuk-template__body class, and also looking for and anchor with govuk-link as a class. These are both fundamental to the styling, and our guidance pushes for teams who need to make custom modifications to use BEM, so in theory these classes should exist somewhere in the codebase.
We could get even more granular by looking for specific font names we use, since they will be very unique. But I’d hope the above is enough.
RE: Service domains. I’m not sure if you made a typo in the query? You have service.go.uk listed above rather than service.gov.uk. I’ve just run a quick query using the SQL you listed in your first reply and found 63 sites (there are around ~240 in total according to our internal team), so many will likely be excluded.
Here’s an example query using @paulcalvano’s scratchspace table that matches sites having govuk-template__body or govuk-link in their HTML.
SELECT
url
FROM
`httparchive.scratchspace.gov_uk_response_bodies_2020_01_mobile`
WHERE
page = url AND
REGEXP_CONTAINS(body, '(govuk-template__body|govuk-link)')
The regexp is overly simple and doesn’t distinguish between the patterns being found on specific tags, but the strings are unique enough that it shouldn’t matter much.
Thank you @paulcalvano / @rviscomi. This is really useful! From this example we can now expand this out to cover other areas we want to track like accessibility (aria tags etc), and our legacy code bases.
Is the ability to create smaller data sets with the response bodies something I can do (e.g. when the next crawl data is released)? Or is it only possible as a maintainer?
Your personal BigQuery account comes with its own storage space, so you can create your own tables and work on subsets of the public dataset as needed. From the BigQuery UI you can configure any query to write its results to a destination table of your choosing, similar to how Paul created the scratchspace table.
Yes, the summary_requests tables include a firstHtml field which would be set to true for the canonical URL.
We got the initial URLs from the Chrome UX Report, which represents the websites real Chrome users visit, so I wouldn’t expect to see too many HTML redirects. @Nooshu do you know if some of these sites are redirecting because our tests are unauthenticated and/or run from the US? Do they always redirect or only under certain conditions?
A lot of these will redirect in all conditions. I only know of a single (and very recent) service with any sort of geo-blocking so it won’t be that. And in terms of authentication, we don’t have any form of user-login for access to most services. HMRC (Tax department) requires some for their services. Some examples of redirects: