HTTP Archive project vs. state-backed disinformation operations

I am working on a git repo and a group effort to monitor the attack vectors mentioned in the billion-dollar disinformation campaign to reelect the president in 2020: https://github.com/MassMove/AttackVectors… a quick intro:

Presiding over this effort is Brad Parscale , a 6-foot-8 Viking of a man with a shaved head and a triangular beard . As the digital director of Trump’s 2016 campaign , Parscale didn’t become a household name like Steve Bannon and Kellyanne Conway. But he played a crucial role in delivering Trump to the Oval Office— and his efforts will shape this year’s election .

Parscale has indicated that he plans to open up a new front in this war: local news. Last year, he said the campaign intends to train “swarms of surrogates” to undermine negative coverage from local TV stations and newspapers. Polls have long found that Americans across the political spectrum trust local news more than national media. If the campaign has its way, that trust will be eroded by November.

Running parallel to this effort, some conservatives have been experimenting with a scheme to exploit the credibility of local journalism . Over the past few years, hundreds of websites with innocuous-sounding names like the Arizona Monitor and The Kalamazoo Times have begun popping up. At first glance, they look like regular publications, complete with community notices and coverage of schools. But look closer and you’ll find that there are often no mastheads, few if any bylines, and no addresses for local offices. Many of them are organs of Republican lobbying groups; others belong to a mysterious company called Locality Labs, which is run by a conservative activist in Illinois. Readers are given no indication that these sites have political agendas—which is precisely what makes them valuable.

When Twitter employees later reviewed the activity surrounding Kentucky’s election, they concluded that the bots were largely based in America— a sign that political operatives here were learning to mimic [foreign tactics] .

Their shit looks really real: https://kalamazootimes.com until you start looking at all the articles at once: https://kalamazootimes.com/stories/tag/126-politics

So far we have found over 700 domains with more on the way in sites.csv and an interactive heat map of where they purport to report from.

I was made aware of the remarkable work you guys and gals did rooting out hidden crypto-currency miners: https://discuss.httparchive.org/t/the-performance-impact-of-cryptocurrency-mining-on-the-web/1126. And was hoping to get a hand or two with this group effort to give democracy a fighting chance.

With some divine guidance I was able to upload sites.csv to httparchive.scratchspace.massmove:

SELECT * FROM `httparchive.scratchspace.massmove` LIMIT 1000

QUERY RESULTS: Table I

And have run some initial queries, like this one to find if there are any common 3rd party hosts among the known sites:

SELECT APPROX_TOP_COUNT(NET.HOST(url), 20) AS req_host
FROM httparchive.summary_requests.2020_02_01_mobile  
JOIN (
    SELECT pageid, url AS page
    FROM httparchive.summary_pages.2020_02_01_mobile
)
USING (pageid)
WHERE NET.REG_DOMAIN(page)
IN (
    SELECT DISTINCT domain
    FROM httparchive.scratchspace.massmove
)

QUERY RESULTS: Table II

The most popular host is jnswire.s3.amazonaws.com. So we flipped the query around and looked for any website that made a request to that host:

SELECT DISTINCT page
FROM httparchive.summary_requests.2020_02_01_mobile  
JOIN (
    SELECT pageid, url AS page
    FROM httparchive.summary_pages.2020_02_01_mobile
)
USING (pageid)
WHERE STARTS_WITH(url, 'https://jnswire.s3.amazonaws.com')

QUERY RESULTS: Table III

There are 21 results: the 20 known sites plus rgs-istilah-hukum.blogspot.com. They innocently, but still interestingly enough just hot-link this image: https://jnswire.s3.amazonaws.com/jns-media/98/f7/176642/discrimination_16.jpg from the network: https://madisonrecord.com/stories/511475189-woman-alleges-race-was-factor-in-termination-from-cooper-b-line and https://legalnewsline.com/stories/511427360-eeoc-south-carolina-child-development-center-fired-employee-over-drug-prescription

The dataset can do other interesting things, like give a rough fingerprint of web technologies used to build the sites:

SELECT category, app, COUNT(0) AS freq
FROM httparchive.technologies.2020_02_01_mobile
WHERE NET.REG_DOMAIN(url) IN (
    SELECT DISTINCT domain
    FROM httparchive.scratchspace.massmove
)
GROUP BY app ORDER BY freq DESC

QUERY RESULTS: Table IV

The results show all 20 sites using nginx, Facebook (like button probably), jQuery, GTM, etc. So maybe this info could be used to look for other similarly-built sites:

Row category app freq
1 Widgets Facebook 20
2 Tag Managers Google Tag Manager 20
3 Reverse Proxy Nginx 20
4 Web Servers Nginx 20
5 JavaScript Libraries jQuery 20
6 Analytics New Relic 20
7 Analytics Google Analytics 20

Things get interesting when we plug in Google Analytics tags scraped from the domains. WARNING: the query consumes 10 TB ($50 @ $5/TB) for a given month, so only run it if you have cost controls set up:

SELECT page, REGEXP_EXTRACT(body, '(UA-114372942-|UA-114396355-|UA-147159596-|UA-147358532-|UA-147552306-|UA-147966219-|UA-147973896-|UA-147983590-|UA-148428291-|UA-149669420-|UA-151957030-|UA-15309596-|UA-474105-|UA-58698159-|UA-75903094-|UA-89264302-)') AS ga
FROM httparchive.response_bodies.2020_02_01_mobile
WHERE page = url
AND REGEXP_CONTAINS(body, '(UA-114372942-|UA-114396355-|UA-147159596-|UA-147358532-|UA-147552306-|UA-147966219-|UA-147973896-|UA-147983590-|UA-148428291-|UA-149669420-|UA-151957030-|UA-15309596-|UA-474105-|UA-58698159-|UA-75903094-|UA-89264302-)')

QUERY RESULTS: Table V: legalnewsline.com and madisonrecord.com go back to 2014.

Initially there was a glitch with the regex - the trailing dash was missing so madisonrecord.com returned correctly with UA-474105-7, but krasivye-pozdravlenija.ru with UA-474105[9]-4 and all sorts of random and unrelated stuff popped up!

That is as far as we have gotten and I am only beginning to cut my teeth with the web transparency data in httparchive…

Please shout if you have any ideas how the HTTP Archive project and the publicly available data in the httparchive repository on Google BigQuery can help us find what else these domains are and were connected to! Query cost is of no concern in light of this information’s want for freedom.

1 Like

Thanks @mcoder. Everyone else, for context: I was offering some help getting started with the dataset and added the domain list to the scratchspace table.

I’d be super curious to hear if anyone else can think of other insights we could glean from the dataset.

2 Likes

There are a few other places I’d look but in a quick initial glance it doesn’t look like they add much:

Page-level information about the hosting:

base_page_dns_server: "ns-1483.awsdns-57.org",
base_page_ip_ptr: "ec2-3-218-216-245.compute-1.amazonaws.com",

The site is running on an ec2 instance (no CDN). Might be worthwhile to see if any others are served from the same server but from the certificate data (below) it is unlikely.

From the request data (for the base page request):

ip_addr: "3.218.216.245",

securityDetails: {
    certificateId: 0,
    protocol: "TLS 1.2",
    keyExchange: "ECDHE_RSA",
    validTo: 1599263999,
    certificateTransparencyCompliance: "compliant",
    sanList: [
        "kalamazootimes.com",
        "www.kalamazootimes.com"
    ],
    subjectName: "kalamazootimes.com",
    keyExchangeGroup: "P-384",
    validFrom: 1567641600,
    signedCertificateTimestampList: [
    {
        status: "Verified",
        origin: "Embedded in certificate",
        logDescription: "Google 'Argon2020' log",
        signatureData: "304402202E791348B7185AF24D9780631FCE5E2BBF3F41A8DF1D3DCBF5F860E86CD70AA40220586E29B1C5BFFA4C955294A3BD082B134A7D664BAE58A1B4463ACE9A6C28A643",
        timestamp: 1567713665536,
        hashAlgorithm: "SHA-256",
        logId: "B21E05CC8BA2CD8A204E8766F92BB98A2520676BDAFA70E7B249532DEF8B905E",
        signatureAlgorithm: "ECDSA"
    },
    {
        status: "Verified",
        origin: "Embedded in certificate",
        logDescription: "Cloudflare 'Nimbus2020' Log",
        signatureData: "304402202A819B20AD20D3F0510FD6A77B9C998551357AB7E971A27A4D07462CD41EC59B02204F59BF123DCA8BBA4259A1E64A8DBB986701D49D55933C13133734BE04BC65D6",
        timestamp: 1567713665566,
        hashAlgorithm: "SHA-256",
        logId: "5EA773F9DF56C0E7B536487DD049E0327A919A0C84A112128418759681714558",
        signatureAlgorithm: "ECDSA"
    }
    ],
    cipher: "AES_256_GCM",
    issuer: "Sectigo RSA Domain Validation Secure Server CA"
},

There are no other domains in the cert san list so it is probably the only domain from that server. Unfortunately there’s nothing particularly interesting with the certificate details. Maybe if the certificates for the other pages were also generated by Sectigo (aka Comodo) around the same time it might be worth scanning for others also issued around then.

The tech stack also doesn’t look particularly unique and will probably generate way more false positives than it’s worth.

The common s3 bucket used for the images looks like the best link to me.

2 Likes

No, thank you @rviscomi!

135 domains are served from that IP address:
LocalJournals/sites.csv#L351-L485, so I think you may be onto something!

Can we query for all certificates or base_page_'s with the discovered IPs, in case we missed any with the conventional methods: https://github.com/MassMove/AttackVectors/pull/3?

Yep. Rick is much better with BigQuery than me but the requests dataset has both the IP address and certificate details so it should be a relatively easy json_extract and where clause.

It’s worth noting that that IP address is what was being used yesterday, I don’t know if it was the same during the crawl (or if it changed through the month if it isn’t reserved). It’s unusual for a web-accessible web server to rotate IP addresses frequently though (specially multiple servers) so it’s likely safe to use.

2 Likes

This query looks for any pages served from an IP of a website flagged in the massmove table:

SELECT
  page,
  JSON_EXTRACT_SCALAR(payload, '$._ip_addr') AS ip_addr,
  NET.REG_DOMAIN(page) IN (SELECT domain FROM `httparchive.scratchspace.massmove`) AS flagged
FROM
  `httparchive.requests.2020_02_01_mobile`
WHERE
  JSON_EXTRACT_SCALAR(payload, '$._ip_addr') IN (
    SELECT
      JSON_EXTRACT_SCALAR(payload, '$._ip_addr')
    FROM
      `httparchive.requests.2020_02_01_mobile`
    WHERE
      NET.REG_DOMAIN(page) IN (SELECT domain FROM `httparchive.scratchspace.massmove`) AND
      JSON_EXTRACT_SCALAR(payload, '$._final_base_page') = 'true') AND
  JSON_EXTRACT_SCALAR(payload, '$._final_base_page') = 'true'
ORDER BY
  flagged DESC

There are 651 results however all of the discovered websites have the same IP address “74.125.142.155”, which was matched from https://chicagocitywire.com/. I think the only connection we can draw is that these sites happen to be on the same shared hosting server, not that they were created by the same entity.

1 Like

Nice, thanks for that. Can you run another search for these tags: UA-147094394- and UA-63225229-? Just a month or two should be enough to start with!

How would we go about querying for the Facebook Pixel ID 485774048928360 and Quantserv tracking cookkie p-tBWRHfpb70G7L.gif?

Here’s a query that checks for the GA tags, Facebook pixel, and Quantserv ID all in one:

SELECT
  url,
  REGEXP_EXTRACT(body, '(UA-147094394-|UA-63225229-|485774048928360|p-tBWRHfpb70G7L)') AS match
FROM
  httparchive.response_bodies.2020_02_01_mobile
WHERE
  REGEXP_CONTAINS(body, '(UA-147094394-|UA-63225229-|485774048928360|p-tBWRHfpb70G7L)')

Only the Quantserv ID matched:

All of these URLs are known domains in the scratchspace.massmove table.

did anyone click on these links? :eyes:

I’m not clicking it! :smile: