HTTP Archive project vs. state-backed disinformation operations

I am working on a git repo and a group effort to monitor the attack vectors mentioned in the billion-dollar disinformation campaign to reelect the president in 2020:… a quick intro:

Presiding over this effort is Brad Parscale , a 6-foot-8 Viking of a man with a shaved head and a triangular beard . As the digital director of Trump’s 2016 campaign , Parscale didn’t become a household name like Steve Bannon and Kellyanne Conway. But he played a crucial role in delivering Trump to the Oval Office— and his efforts will shape this year’s election .

Parscale has indicated that he plans to open up a new front in this war: local news. Last year, he said the campaign intends to train “swarms of surrogates” to undermine negative coverage from local TV stations and newspapers. Polls have long found that Americans across the political spectrum trust local news more than national media. If the campaign has its way, that trust will be eroded by November.

Running parallel to this effort, some conservatives have been experimenting with a scheme to exploit the credibility of local journalism . Over the past few years, hundreds of websites with innocuous-sounding names like the Arizona Monitor and The Kalamazoo Times have begun popping up. At first glance, they look like regular publications, complete with community notices and coverage of schools. But look closer and you’ll find that there are often no mastheads, few if any bylines, and no addresses for local offices. Many of them are organs of Republican lobbying groups; others belong to a mysterious company called Locality Labs, which is run by a conservative activist in Illinois. Readers are given no indication that these sites have political agendas—which is precisely what makes them valuable.

When Twitter employees later reviewed the activity surrounding Kentucky’s election, they concluded that the bots were largely based in America— a sign that political operatives here were learning to mimic [foreign tactics] .

Their shit looks really real: until you start looking at all the articles at once:

So far we have found over 700 domains with more on the way in sites.csv and an interactive heat map of where they purport to report from.

I was made aware of the remarkable work you guys and gals did rooting out hidden crypto-currency miners: And was hoping to get a hand or two with this group effort to give democracy a fighting chance.

With some divine guidance I was able to upload sites.csv to httparchive.scratchspace.massmove:

SELECT * FROM `httparchive.scratchspace.massmove` LIMIT 1000


And have run some initial queries, like this one to find if there are any common 3rd party hosts among the known sites:

FROM httparchive.summary_requests.2020_02_01_mobile  
    SELECT pageid, url AS page
    FROM httparchive.summary_pages.2020_02_01_mobile
USING (pageid)
IN (
    FROM httparchive.scratchspace.massmove


The most popular host is So we flipped the query around and looked for any website that made a request to that host:

FROM httparchive.summary_requests.2020_02_01_mobile  
    SELECT pageid, url AS page
    FROM httparchive.summary_pages.2020_02_01_mobile
USING (pageid)


There are 21 results: the 20 known sites plus They innocently, but still interestingly enough just hot-link this image: from the network: and

The dataset can do other interesting things, like give a rough fingerprint of web technologies used to build the sites:

SELECT category, app, COUNT(0) AS freq
FROM httparchive.technologies.2020_02_01_mobile
    FROM httparchive.scratchspace.massmove


The results show all 20 sites using nginx, Facebook (like button probably), jQuery, GTM, etc. So maybe this info could be used to look for other similarly-built sites:

Row category app freq
1 Widgets Facebook 20
2 Tag Managers Google Tag Manager 20
3 Reverse Proxy Nginx 20
4 Web Servers Nginx 20
5 JavaScript Libraries jQuery 20
6 Analytics New Relic 20
7 Analytics Google Analytics 20

Things get interesting when we plug in Google Analytics tags scraped from the domains. WARNING: the query consumes 10 TB ($50 @ $5/TB) for a given month, so only run it if you have cost controls set up:

SELECT page, REGEXP_EXTRACT(body, '(UA-114372942-|UA-114396355-|UA-147159596-|UA-147358532-|UA-147552306-|UA-147966219-|UA-147973896-|UA-147983590-|UA-148428291-|UA-149669420-|UA-151957030-|UA-15309596-|UA-474105-|UA-58698159-|UA-75903094-|UA-89264302-)') AS ga
FROM httparchive.response_bodies.2020_02_01_mobile
WHERE page = url
AND REGEXP_CONTAINS(body, '(UA-114372942-|UA-114396355-|UA-147159596-|UA-147358532-|UA-147552306-|UA-147966219-|UA-147973896-|UA-147983590-|UA-148428291-|UA-149669420-|UA-151957030-|UA-15309596-|UA-474105-|UA-58698159-|UA-75903094-|UA-89264302-)')

QUERY RESULTS: Table V: and go back to 2014.

Initially there was a glitch with the regex - the trailing dash was missing so returned correctly with UA-474105-7, but with UA-474105[9]-4 and all sorts of random and unrelated stuff popped up!

That is as far as we have gotten and I am only beginning to cut my teeth with the web transparency data in httparchive…

Please shout if you have any ideas how the HTTP Archive project and the publicly available data in the httparchive repository on Google BigQuery can help us find what else these domains are and were connected to! Query cost is of no concern in light of this information’s want for freedom.

1 Like

Thanks @mcoder. Everyone else, for context: I was offering some help getting started with the dataset and added the domain list to the scratchspace table.

I’d be super curious to hear if anyone else can think of other insights we could glean from the dataset.


There are a few other places I’d look but in a quick initial glance it doesn’t look like they add much:

Page-level information about the hosting:

base_page_dns_server: "",
base_page_ip_ptr: "",

The site is running on an ec2 instance (no CDN). Might be worthwhile to see if any others are served from the same server but from the certificate data (below) it is unlikely.

From the request data (for the base page request):

ip_addr: "",

securityDetails: {
    certificateId: 0,
    protocol: "TLS 1.2",
    keyExchange: "ECDHE_RSA",
    validTo: 1599263999,
    certificateTransparencyCompliance: "compliant",
    sanList: [
    subjectName: "",
    keyExchangeGroup: "P-384",
    validFrom: 1567641600,
    signedCertificateTimestampList: [
        status: "Verified",
        origin: "Embedded in certificate",
        logDescription: "Google 'Argon2020' log",
        signatureData: "304402202E791348B7185AF24D9780631FCE5E2BBF3F41A8DF1D3DCBF5F860E86CD70AA40220586E29B1C5BFFA4C955294A3BD082B134A7D664BAE58A1B4463ACE9A6C28A643",
        timestamp: 1567713665536,
        hashAlgorithm: "SHA-256",
        logId: "B21E05CC8BA2CD8A204E8766F92BB98A2520676BDAFA70E7B249532DEF8B905E",
        signatureAlgorithm: "ECDSA"
        status: "Verified",
        origin: "Embedded in certificate",
        logDescription: "Cloudflare 'Nimbus2020' Log",
        signatureData: "304402202A819B20AD20D3F0510FD6A77B9C998551357AB7E971A27A4D07462CD41EC59B02204F59BF123DCA8BBA4259A1E64A8DBB986701D49D55933C13133734BE04BC65D6",
        timestamp: 1567713665566,
        hashAlgorithm: "SHA-256",
        logId: "5EA773F9DF56C0E7B536487DD049E0327A919A0C84A112128418759681714558",
        signatureAlgorithm: "ECDSA"
    cipher: "AES_256_GCM",
    issuer: "Sectigo RSA Domain Validation Secure Server CA"

There are no other domains in the cert san list so it is probably the only domain from that server. Unfortunately there’s nothing particularly interesting with the certificate details. Maybe if the certificates for the other pages were also generated by Sectigo (aka Comodo) around the same time it might be worth scanning for others also issued around then.

The tech stack also doesn’t look particularly unique and will probably generate way more false positives than it’s worth.

The common s3 bucket used for the images looks like the best link to me.


No, thank you @rviscomi!

135 domains are served from that IP address:
LocalJournals/sites.csv#L351-L485, so I think you may be onto something!

Can we query for all certificates or base_page_'s with the discovered IPs, in case we missed any with the conventional methods:

Yep. Rick is much better with BigQuery than me but the requests dataset has both the IP address and certificate details so it should be a relatively easy json_extract and where clause.

It’s worth noting that that IP address is what was being used yesterday, I don’t know if it was the same during the crawl (or if it changed through the month if it isn’t reserved). It’s unusual for a web-accessible web server to rotate IP addresses frequently though (specially multiple servers) so it’s likely safe to use.


This query looks for any pages served from an IP of a website flagged in the massmove table:

  JSON_EXTRACT_SCALAR(payload, '$._ip_addr') AS ip_addr,
  NET.REG_DOMAIN(page) IN (SELECT domain FROM `httparchive.scratchspace.massmove`) AS flagged
  JSON_EXTRACT_SCALAR(payload, '$._ip_addr') IN (
      JSON_EXTRACT_SCALAR(payload, '$._ip_addr')
      NET.REG_DOMAIN(page) IN (SELECT domain FROM `httparchive.scratchspace.massmove`) AND
      JSON_EXTRACT_SCALAR(payload, '$._final_base_page') = 'true') AND
  JSON_EXTRACT_SCALAR(payload, '$._final_base_page') = 'true'
  flagged DESC

There are 651 results however all of the discovered websites have the same IP address “”, which was matched from I think the only connection we can draw is that these sites happen to be on the same shared hosting server, not that they were created by the same entity.

1 Like

Nice, thanks for that. Can you run another search for these tags: UA-147094394- and UA-63225229-? Just a month or two should be enough to start with!

How would we go about querying for the Facebook Pixel ID 485774048928360 and Quantserv tracking cookkie p-tBWRHfpb70G7L.gif?

Here’s a query that checks for the GA tags, Facebook pixel, and Quantserv ID all in one:

  REGEXP_EXTRACT(body, '(UA-147094394-|UA-63225229-|485774048928360|p-tBWRHfpb70G7L)') AS match
  REGEXP_CONTAINS(body, '(UA-147094394-|UA-63225229-|485774048928360|p-tBWRHfpb70G7L)')

Only the Quantserv ID matched:

All of these URLs are known domains in the scratchspace.massmove table.

did anyone click on these links? :eyes:

I’m not clicking it! :smile:

@rviscomi - can you run UA-60996284- for us? We are on a new deep dive:

SELECT page, REGEXP_EXTRACT(body, '(UA-60996284-)') AS ga
FROM httparchive.response_bodies.2020_03_01_mobile
WHERE page = url
AND REGEXP_CONTAINS(body, '(UA-60996284-)')
1 Like

Very interesting, thanks! Any chance you can run it historically to see if the tag was previously used elsewhere?

I ran the query against the December 2019 dataset and the only result was

For context… These state/domestic sites remind me of the IRA ‘local news site’ SM accounts, and the also the network out of India… similarly modeled influence operations that take advantage of citizens ‘trust’ in local news.


Network of +265 fake local news sites in more than 65 countries operating out of India.