Hi there,
I’m working with a tiny NGO, the Green Web Foundation, and I’m trying to do some analysis to work out roughly how much of the web runs on green power, versus running on fossil fuels. I tried this a year ago, before I started working directly with the Green Web Foundation, but I had a few questions that I couldn’t work out myself when seeing how to do some new analysis.
I hope it’s okay to ask here.
Generating the same list of urls used for the Chrome UX report
I can see that the HTTP Archive uses something like 4-5 million urls for this report here.
https://httparchive.org/reports/state-of-the-web#numUrls
This seems to be different to the list of urls if I do a basic query like so from the chrome UX report
SELECT
origin
FROM
[chrome-ux-report:all.201905]
I’d like to be able to make a dataset based on the same urls so it’s possible to do some kind of future analysis.
For the Green Web Checks, we show not just how a site is powered by looking up the domain against infrastructure records but we also can infer some information about who is doing the hosting for this too.
However, I’m still pretty new to BigQuery, and I can’t seem to copy the something like the most recent dataset to my own Big Query Project.
Every time I try to copy a table for generating a dataset, like so:
to a new project like so:
I keep getting this error:
Is there a guide for setting it up I can follow to sanity check what I’m doing wrong?
Checking these against the greenweb check API
Earlier this year, we open sourced the underlying code that the Green Web Foundation green web check software uses, and we now create open datasets based on the data we’ve generated.
I’m wondering if the code might be runnable as a module in Wappalyzer (as I know it’s used to run checks anyway) but I couldn’t find information about how Wappalyzer is set up for the monthly analysis runs.
If Wappalyser has access to the origin (i.e. the domain being checked), and the list of domains a page refers to for extra HTTP requests, it should be possible to generate stats like:
- average percentage of requests to fossil fuel powered infra, or
- percentage of a page’s weight served from renewable infra,
We’d ideally be able to see how this changes over time, as more companies get better at running greener infrastructure.
If I wanted to create a dataset like this, where would I look to see how the HTTP Archive is set up to do these wappalyzer ‘runs’?
I’ve looked through the Wappalyzer code, but I can’t quite visualise the pipeline yet, and it would be nice to be able to understand how I’d make something that could be run, to make complementary analysis possible.
Because the code is open source (Apache 2 or MIT mainly), and we release open data, we wouldn’t even need to hit the GreenWeb API to generate the stats about renewable energy - we could create some kind of prepackaged npm module containing all the data and code as a self contained package to avoid needing to make network calls.
Anyway - that’s me, and my two questions:
- how can I get the full list of urls needed to make an analysis of the same domains used in the HTTP Archives regular report?
- where would I find information about how Wappalyzer is set up if I wanted to either contribute some checker for Wappalyzer?
Thanks,
Chris