The HTTP Archive dataset is over 200 TB all together. And it’s growing at a rate of about 5 TB per month.
Sometimes, you need to query over a huge part of it. For example, I’ve been running queries to apply the Wappalyzer technology detection logic over all of the HTML response bodies going back to January 2016 when we first started collecting them. This detection will allow us to retroactively track the adoption of technologies like WordPress, the entire e-commerce category, etc. As you can imagine, these queries get quite expensive and you might find yourself needing to run similar queries for some reason in the future. So here’s a brief description of what I’ve been working on and how I run really big queries on BigQuery.
To start, I created a big scratchspace table with every HTML response’s headers and body, organized by url, date, and client. You can see it here: https://bigquery.cloud.google.com/table/httparchive:scratchspace.wappalyzer_prepared
Next, I imported the Wappalyzer detection rules into BigQuery. These are maintained as JSON, so I wrote a JS script to convert the parts I care about into CSV format and import that into a BigQuery table. This table originally contained repeated fields (arrays) so I flattened it into https://bigquery.cloud.google.com/table/httparchive:wappalyzer.flattened.
Now we’re ready to join the two tables and detect all the things. Or are we? BigQuery has resource constraints so a single query can’t really handle 50 billion joined rows with a lot of regular expression matching very easily.
The solution I took, with help from @fhoffa is to change the joined wappalyzer.flattened table into a subquery with LIMIT
and OFFSET
clauses. We found that 10 flattened detections at a time was enough for BigQuery to handle. So we broke up the monolithic query into 175 smaller queries.
I used a parameterized query to make the OFFSET
variable:
#standardSQL
# WARNING: Running this query costs $400. You will regret this!
CREATE TEMPORARY FUNCTION getHeader(headers STRING, name STRING) AS (
REGEXP_EXTRACT(
headers,
CONCAT('{"name":"', name, '[^}]*'))
);
CREATE TEMPORARY FUNCTION hasHeader(headers STRING, name STRING, value STRING) AS (
getHeader(headers, name) IS NOT NULL AND (
value = '' OR
REGEXP_CONTAINS(JSON_EXTRACT(getHeader(headers, name), '$.value'), value))
);
SELECT
date,
client,
url,
category,
app,
IFNULL(REGEXP_EXTRACT(REGEXP_EXTRACT(body, CONCAT('(?i)<(meta[^>]+name=[\'"]?', meta.name, '[^>]*)')), meta.value), '') AS info
FROM
`httparchive.scratchspace.wappalyzer_prepared`,
(SELECT * FROM `httparchive.wappalyzer.flattened` ORDER BY category, app, html, script, header.value, header.name, meta.value, meta.name LIMIT 10 OFFSET @offset)
WHERE
(html IS NOT NULL AND html NOT LIKE '%?!%' AND REGEXP_CONTAINS(body, html)) OR
(script IS NOT NULL AND REGEXP_CONTAINS(body, CONCAT('(?i)<script[^>]+src=[\'"]?[^\'"]*', script))) OR
(meta.name IS NOT NULL AND REGEXP_CONTAINS(REGEXP_EXTRACT(body, CONCAT('(?i)<(meta[^>]+name=[\'"]?', meta.name, '[^>]*)')), meta.value)) OR
(header.name IS NOT NULL AND hasHeader(headers, header.name, header.value))
GROUP BY
date,
client,
url,
category,
app,
body,
meta.name,
meta.value
This query isn’t runnable on its own because it has the @offset
parameter. To fill the template, we can only run the query using the bq
command line interface as opposed to on the web UI. For example, here’s how to run it for the first batch of 10 detections (offset=0):
cat wappalyzer.sql | bq --quiet query --append_table --batch --parameter=offset:INT64:0 --destination_table httparchive:wappalyzer.detected
It’s important to include the --batch
flag to ensure that the queries are enqueued and can be run when resources are available. So I wrote a for loop from 0 to 170 in increments of 10 and added a command like the one above to a script file, with the loop iterator injected into the offset parameter. If you’re interested I wrote it in JS in the browser console and copied it to a text file:
a=[]
for (let i = 0; i < 1744; i+=10) a.push(i)
copy(a.map(i=>`cat wappalyzer.sql | bq --quiet query --append_table --batch --parameter=offset:INT64:${i} --destination_table httparchive:wappalyzer.detected`).join(' &\n'))
The funny join at the end appends &
to each command so they don’t need to wait for the previous one to complete before issuing the new query.
The result is a whole lot of batched queries consuming a whole lot of resources, but the good news is that they’re not timing out!
If you’re interested in more bq
command line shenanigans, I wrote a blog post describing how I took the monolithic output appended to the wappalyzer.detected table and fragmented it into YYYY_MM_DD_client
tables for consistency with the rest of the HTTP Archive tables. You can read more about it at Fragmenting a monolithic BigQuery table with the bq CLI.