HTTP Archive project vs. state-backed disinformation operations

mcoder · March 6, 2020, 1:28pm

I am working on a git repo and a group effort to monitor the attack vectors mentioned in the billion-dollar disinformation campaign to reelect the president in 2020: https://github.com/MassMove/AttackVectors… a quick intro:

Presiding over this effort is Brad Parscale , a 6-foot-8 Viking of a man with a shaved head and a triangular beard . As the digital director of Trump’s 2016 campaign , Parscale didn’t become a household name like Steve Bannon and Kellyanne Conway. But he played a crucial role in delivering Trump to the Oval Office— and his efforts will shape this year’s election .

Parscale has indicated that he plans to open up a new front in this war: local news. Last year, he said the campaign intends to train “swarms of surrogates” to undermine negative coverage from local TV stations and newspapers. Polls have long found that Americans across the political spectrum trust local news more than national media. If the campaign has its way, that trust will be eroded by November.

Running parallel to this effort, some conservatives have been experimenting with a scheme to exploit the credibility of local journalism . Over the past few years, hundreds of websites with innocuous-sounding names like the Arizona Monitor and The Kalamazoo Times have begun popping up. At first glance, they look like regular publications, complete with community notices and coverage of schools. But look closer and you’ll find that there are often no mastheads, few if any bylines, and no addresses for local offices. Many of them are organs of Republican lobbying groups; others belong to a mysterious company called Locality Labs, which is run by a conservative activist in Illinois. Readers are given no indication that these sites have political agendas—which is precisely what makes them valuable.

When Twitter employees later reviewed the activity surrounding Kentucky’s election, they concluded that the bots were largely based in America— a sign that political operatives here were learning to mimic [foreign tactics] .

Their shit looks really real: https://kalamazootimes.com until you start looking at all the articles at once: https://kalamazootimes.com/stories/tag/126-politics

So far we have found over 700 domains with more on the way in sites.csv and an interactive heat map of where they purport to report from.

I was made aware of the remarkable work you guys and gals did rooting out hidden crypto-currency miners: https://discuss.httparchive.org/t/the-performance-impact-of-cryptocurrency-mining-on-the-web/1126. And was hoping to get a hand or two with this group effort to give democracy a fighting chance.

With some divine guidance I was able to upload sites.csv to httparchive.scratchspace.massmove:

SELECT * FROM `httparchive.scratchspace.massmove` LIMIT 1000

QUERY RESULTS: Table I

And have run some initial queries, like this one to find if there are any common 3rd party hosts among the known sites:

SELECT APPROX_TOP_COUNT(NET.HOST(url), 20) AS req_host
FROM httparchive.summary_requests.2020_02_01_mobile  
JOIN (
    SELECT pageid, url AS page
    FROM httparchive.summary_pages.2020_02_01_mobile
)
USING (pageid)
WHERE NET.REG_DOMAIN(page)
IN (
    SELECT DISTINCT domain
    FROM httparchive.scratchspace.massmove
)

QUERY RESULTS: Table II

The most popular host is jnswire.s3.amazonaws.com. So we flipped the query around and looked for any website that made a request to that host:

SELECT DISTINCT page
FROM httparchive.summary_requests.2020_02_01_mobile  
JOIN (
    SELECT pageid, url AS page
    FROM httparchive.summary_pages.2020_02_01_mobile
)
USING (pageid)
WHERE STARTS_WITH(url, 'https://jnswire.s3.amazonaws.com')

QUERY RESULTS: Table III

There are 21 results: the 20 known sites plus rgs-istilah-hukum.blogspot.com. They innocently, but still interestingly enough just hot-link this image: https://jnswire.s3.amazonaws.com/jns-media/98/f7/176642/discrimination_16.jpg from the network: Woman alleges race was factor in termination from Cooper B-Line | Madison - St. Clair Record and EEOC: South Carolina child development center fired employee over drug prescription | Legal Newsline

The dataset can do other interesting things, like give a rough fingerprint of web technologies used to build the sites:

SELECT category, app, COUNT(0) AS freq
FROM httparchive.technologies.2020_02_01_mobile
WHERE NET.REG_DOMAIN(url) IN (
    SELECT DISTINCT domain
    FROM httparchive.scratchspace.massmove
)
GROUP BY app ORDER BY freq DESC

QUERY RESULTS: Table IV

The results show all 20 sites using nginx, Facebook (like button probably), jQuery, GTM, etc. So maybe this info could be used to look for other similarly-built sites:

Row	category	app	freq
1	Widgets	Facebook	20
2	Tag Managers	Google Tag Manager	20
3	Reverse Proxy	Nginx	20
4	Web Servers	Nginx	20
5	JavaScript Libraries	jQuery	20
6	Analytics	New Relic	20
7	Analytics	Google Analytics	20

Things get interesting when we plug in Google Analytics tags scraped from the domains. WARNING: the query consumes 10 TB ($50 @ $5/TB) for a given month, so only run it if you have cost controls set up:

SELECT page, REGEXP_EXTRACT(body, '(UA-114372942-|UA-114396355-|UA-147159596-|UA-147358532-|UA-147552306-|UA-147966219-|UA-147973896-|UA-147983590-|UA-148428291-|UA-149669420-|UA-151957030-|UA-15309596-|UA-474105-|UA-58698159-|UA-75903094-|UA-89264302-)') AS ga
FROM httparchive.response_bodies.2020_02_01_mobile
WHERE page = url
AND REGEXP_CONTAINS(body, '(UA-114372942-|UA-114396355-|UA-147159596-|UA-147358532-|UA-147552306-|UA-147966219-|UA-147973896-|UA-147983590-|UA-148428291-|UA-149669420-|UA-151957030-|UA-15309596-|UA-474105-|UA-58698159-|UA-75903094-|UA-89264302-)')

QUERY RESULTS: Table V: legalnewsline.com and madisonrecord.com go back to 2014.

Initially there was a glitch with the regex - the trailing dash was missing so madisonrecord.com returned correctly with UA-474105-7, but krasivye-pozdravlenija.ru with UA-474105[9]-4 and all sorts of random and unrelated stuff popped up!

That is as far as we have gotten and I am only beginning to cut my teeth with the web transparency data in httparchive…

Please shout if you have any ideas how the HTTP Archive project and the publicly available data in the httparchive repository on Google BigQuery can help us find what else these domains are and were connected to! Query cost is of no concern in light of this information’s want for freedom.

rviscomi · March 6, 2020, 6:06pm

Thanks @mcoder. Everyone else, for context: I was offering some help getting started with the dataset and added the domain list to the scratchspace table.

I’d be super curious to hear if anyone else can think of other insights we could glean from the dataset.

patmeenan · March 6, 2020, 7:39pm

There are a few other places I’d look but in a quick initial glance it doesn’t look like they add much:

Page-level information about the hosting:

base_page_dns_server: "ns-1483.awsdns-57.org",
base_page_ip_ptr: "ec2-3-218-216-245.compute-1.amazonaws.com",

The site is running on an ec2 instance (no CDN). Might be worthwhile to see if any others are served from the same server but from the certificate data (below) it is unlikely.

From the request data (for the base page request):

ip_addr: "3.218.216.245",

securityDetails: {
    certificateId: 0,
    protocol: "TLS 1.2",
    keyExchange: "ECDHE_RSA",
    validTo: 1599263999,
    certificateTransparencyCompliance: "compliant",
    sanList: [
        "kalamazootimes.com",
        "www.kalamazootimes.com"
    ],
    subjectName: "kalamazootimes.com",
    keyExchangeGroup: "P-384",
    validFrom: 1567641600,
    signedCertificateTimestampList: [
    {
        status: "Verified",
        origin: "Embedded in certificate",
        logDescription: "Google 'Argon2020' log",
        signatureData: "304402202E791348B7185AF24D9780631FCE5E2BBF3F41A8DF1D3DCBF5F860E86CD70AA40220586E29B1C5BFFA4C955294A3BD082B134A7D664BAE58A1B4463ACE9A6C28A643",
        timestamp: 1567713665536,
        hashAlgorithm: "SHA-256",
        logId: "B21E05CC8BA2CD8A204E8766F92BB98A2520676BDAFA70E7B249532DEF8B905E",
        signatureAlgorithm: "ECDSA"
    },
    {
        status: "Verified",
        origin: "Embedded in certificate",
        logDescription: "Cloudflare 'Nimbus2020' Log",
        signatureData: "304402202A819B20AD20D3F0510FD6A77B9C998551357AB7E971A27A4D07462CD41EC59B02204F59BF123DCA8BBA4259A1E64A8DBB986701D49D55933C13133734BE04BC65D6",
        timestamp: 1567713665566,
        hashAlgorithm: "SHA-256",
        logId: "5EA773F9DF56C0E7B536487DD049E0327A919A0C84A112128418759681714558",
        signatureAlgorithm: "ECDSA"
    }
    ],
    cipher: "AES_256_GCM",
    issuer: "Sectigo RSA Domain Validation Secure Server CA"
},

There are no other domains in the cert san list so it is probably the only domain from that server. Unfortunately there’s nothing particularly interesting with the certificate details. Maybe if the certificates for the other pages were also generated by Sectigo (aka Comodo) around the same time it might be worth scanning for others also issued around then.

The tech stack also doesn’t look particularly unique and will probably generate way more false positives than it’s worth.

The common s3 bucket used for the images looks like the best link to me.

mcoder · March 7, 2020, 7:44am

No, thank you @rviscomi!

135 domains are served from that IP address:
LocalJournals/sites.csv#L351-L485, so I think you may be onto something!

Can we query for all certificates or base_page_'s with the discovered IPs, in case we missed any with the conventional methods: https://github.com/MassMove/AttackVectors/pull/3?

patmeenan · March 7, 2020, 2:16pm

Yep. Rick is much better with BigQuery than me but the requests dataset has both the IP address and certificate details so it should be a relatively easy json_extract and where clause.

It’s worth noting that that IP address is what was being used yesterday, I don’t know if it was the same during the crawl (or if it changed through the month if it isn’t reserved). It’s unusual for a web-accessible web server to rotate IP addresses frequently though (specially multiple servers) so it’s likely safe to use.

rviscomi · March 7, 2020, 8:25pm

This query looks for any pages served from an IP of a website flagged in the massmove table:

SELECT
  page,
  JSON_EXTRACT_SCALAR(payload, '$._ip_addr') AS ip_addr,
  NET.REG_DOMAIN(page) IN (SELECT domain FROM `httparchive.scratchspace.massmove`) AS flagged
FROM
  `httparchive.requests.2020_02_01_mobile`
WHERE
  JSON_EXTRACT_SCALAR(payload, '$._ip_addr') IN (
    SELECT
      JSON_EXTRACT_SCALAR(payload, '$._ip_addr')
    FROM
      `httparchive.requests.2020_02_01_mobile`
    WHERE
      NET.REG_DOMAIN(page) IN (SELECT domain FROM `httparchive.scratchspace.massmove`) AND
      JSON_EXTRACT_SCALAR(payload, '$._final_base_page') = 'true') AND
  JSON_EXTRACT_SCALAR(payload, '$._final_base_page') = 'true'
ORDER BY
  flagged DESC

There are 651 results however all of the discovered websites have the same IP address “74.125.142.155”, which was matched from https://chicagocitywire.com/. I think the only connection we can draw is that these sites happen to be on the same shared hosting server, not that they were created by the same entity.

mcoder · March 10, 2020, 3:53pm

Nice, thanks for that. Can you run another search for these tags: UA-147094394- and UA-63225229-? Just a month or two should be enough to start with!

mentor20 · March 11, 2020, 9:04pm

How would we go about querying for the Facebook Pixel ID 485774048928360 and Quantserv tracking cookkie p-tBWRHfpb70G7L.gif?

rviscomi · March 12, 2020, 1:51am

Here’s a query that checks for the GA tags, Facebook pixel, and Quantserv ID all in one:

SELECT
  url,
  REGEXP_EXTRACT(body, '(UA-147094394-|UA-63225229-|485774048928360|p-tBWRHfpb70G7L)') AS match
FROM
  httparchive.response_bodies.2020_02_01_mobile
WHERE
  REGEXP_CONTAINS(body, '(UA-147094394-|UA-63225229-|485774048928360|p-tBWRHfpb70G7L)')

Only the Quantserv ID matched:

url	match
https://prairiestatewire.com/	p-tBWRHfpb70G7L
https://madisonrecord.com/	p-tBWRHfpb70G7L
https://northcooknews.com/	p-tBWRHfpb70G7L
https://mchenrytimes.com/	p-tBWRHfpb70G7L
https://legalnewsline.com/	p-tBWRHfpb70G7L
https://lakecountygazette.com/	p-tBWRHfpb70G7L
https://southcooknews.com/	p-tBWRHfpb70G7L
https://westcooknews.com/	p-tBWRHfpb70G7L
https://flarecord.com/	p-tBWRHfpb70G7L
https://dupagepolicyjournal.com/	p-tBWRHfpb70G7L
https://kanecountyreporter.com/	p-tBWRHfpb70G7L
https://pennrecord.com/	p-tBWRHfpb70G7L
https://louisianarecord.com/	p-tBWRHfpb70G7L
https://stlrecord.com/	p-tBWRHfpb70G7L
https://wvrecord.com/	p-tBWRHfpb70G7L
https://setexasrecord.com/	p-tBWRHfpb70G7L
https://willcountygazette.com/	p-tBWRHfpb70G7L
https://cookcountyrecord.com/	p-tBWRHfpb70G7L

All of these URLs are known domains in the scratchspace.massmove table.

HenriHelvetica · March 12, 2020, 4:02am

did anyone click on these links?

rviscomi · March 12, 2020, 4:03am

I’m not clicking it!

mentor20 · April 19, 2020, 8:09pm

@rviscomi - can you run UA-60996284- for us? We are on a new deep dive:

SELECT page, REGEXP_EXTRACT(body, '(UA-60996284-)') AS ga
FROM httparchive.response_bodies.2020_03_01_mobile
WHERE page = url
AND REGEXP_CONTAINS(body, '(UA-60996284-)')

rviscomi · April 20, 2020, 5:01am

page	ga
https://action.idahosaa.org/	UA-60996284-
https://www.ohiogunowners.org/	UA-60996284-
https://action.iowagunowners.org/	UA-60996284-

mentor20 · April 20, 2020, 9:36am

Very interesting, thanks! Any chance you can run it historically to see if the tag was previously used elsewhere?

rviscomi · April 23, 2020, 3:36pm

I ran the query against the December 2019 dataset and the only result was https://www.americanfirearmscoalition.org/

ushadrons · June 23, 2020, 8:43pm

For context… These state/domestic sites remind me of the IRA ‘local news site’ SM accounts, and the also the network out of India… similarly modeled influence operations that take advantage of citizens ‘trust’ in local news.

https://russiatweets.com/author-search?terms=News

https://russiatweets.com/author-search?terms=Daily

https://russiatweets.com/author-search?terms=Voice

https://russiatweets.com/author-search?terms=post

EU/Germany https://russiatweets.com/author-search?terms=Bote

Network of +265 fake local news sites in more than 65 countries operating out of India.

ushadrons · June 24, 2020, 3:39am

Topic		Replies	Views
Paul and Rick made a video about how Akamai uses HTTP Archive data to make the web faster and more secure Meta	0	1038	September 5, 2018
New top-level HTTP Archive Report on Progressive Web Apps Analysis	0	1004	August 30, 2018
Video walkthrough of getting started with HTTP Archive on BigQuery Meta	3	1968	March 4, 2018
HTTP Archive New Leadership Meta	0	2999	April 12, 2017
Help us analyze the state of the web in the 2020 Web Almanac! Analysis	0	730	July 2, 2020

HTTP Archive project vs. state-backed disinformation operations

Related topics