I am working on a git repo and a group effort to monitor the attack vectors mentioned in the billion-dollar disinformation campaign to reelect the president in 2020: https://github.com/MassMove/AttackVectors… a quick intro:
Presiding over this effort is Brad Parscale , a 6-foot-8 Viking of a man with a shaved head and a triangular beard . As the digital director of Trump’s 2016 campaign , Parscale didn’t become a household name like Steve Bannon and Kellyanne Conway. But he played a crucial role in delivering Trump to the Oval Office— and his efforts will shape this year’s election .
Parscale has indicated that he plans to open up a new front in this war: local news. Last year, he said the campaign intends to train “swarms of surrogates” to undermine negative coverage from local TV stations and newspapers. Polls have long found that Americans across the political spectrum trust local news more than national media. If the campaign has its way, that trust will be eroded by November.
Running parallel to this effort, some conservatives have been experimenting with a scheme to exploit the credibility of local journalism . Over the past few years, hundreds of websites with innocuous-sounding names like the Arizona Monitor and The Kalamazoo Times have begun popping up. At first glance, they look like regular publications, complete with community notices and coverage of schools. But look closer and you’ll find that there are often no mastheads, few if any bylines, and no addresses for local offices. Many of them are organs of Republican lobbying groups; others belong to a mysterious company called Locality Labs, which is run by a conservative activist in Illinois. Readers are given no indication that these sites have political agendas—which is precisely what makes them valuable.
When Twitter employees later reviewed the activity surrounding Kentucky’s election, they concluded that the bots were largely based in America— a sign that political operatives here were learning to mimic [foreign tactics] .
Their shit looks really real: https://kalamazootimes.com until you start looking at all the articles at once: https://kalamazootimes.com/stories/tag/126-politics
So far we have found over 700 domains with more on the way in sites.csv and an interactive heat map of where they purport to report from.
I was made aware of the remarkable work you guys and gals did rooting out hidden crypto-currency miners: https://discuss.httparchive.org/t/the-performance-impact-of-cryptocurrency-mining-on-the-web/1126. And was hoping to get a hand or two with this group effort to give democracy a fighting chance.
With some divine guidance I was able to upload sites.csv to httparchive.scratchspace.massmove:
SELECT * FROM `httparchive.scratchspace.massmove` LIMIT 1000
And have run some initial queries, like this one to find if there are any common 3rd party hosts among the known sites:
SELECT APPROX_TOP_COUNT(NET.HOST(url), 20) AS req_host
FROM httparchive.summary_requests.2020_02_01_mobile
JOIN (
SELECT pageid, url AS page
FROM httparchive.summary_pages.2020_02_01_mobile
)
USING (pageid)
WHERE NET.REG_DOMAIN(page)
IN (
SELECT DISTINCT domain
FROM httparchive.scratchspace.massmove
)
The most popular host is jnswire.s3.amazonaws.com. So we flipped the query around and looked for any website that made a request to that host:
SELECT DISTINCT page
FROM httparchive.summary_requests.2020_02_01_mobile
JOIN (
SELECT pageid, url AS page
FROM httparchive.summary_pages.2020_02_01_mobile
)
USING (pageid)
WHERE STARTS_WITH(url, 'https://jnswire.s3.amazonaws.com')
There are 21 results: the 20 known sites plus rgs-istilah-hukum.blogspot.com. They innocently, but still interestingly enough just hot-link this image: https://jnswire.s3.amazonaws.com/jns-media/98/f7/176642/discrimination_16.jpg from the network: Woman alleges race was factor in termination from Cooper B-Line | Madison - St. Clair Record and EEOC: South Carolina child development center fired employee over drug prescription | Legal Newsline
The dataset can do other interesting things, like give a rough fingerprint of web technologies used to build the sites:
SELECT category, app, COUNT(0) AS freq
FROM httparchive.technologies.2020_02_01_mobile
WHERE NET.REG_DOMAIN(url) IN (
SELECT DISTINCT domain
FROM httparchive.scratchspace.massmove
)
GROUP BY app ORDER BY freq DESC
The results show all 20 sites using nginx, Facebook (like button probably), jQuery, GTM, etc. So maybe this info could be used to look for other similarly-built sites:
Row | category | app | freq |
---|---|---|---|
1 | Widgets | 20 | |
2 | Tag Managers | Google Tag Manager | 20 |
3 | Reverse Proxy | Nginx | 20 |
4 | Web Servers | Nginx | 20 |
5 | JavaScript Libraries | jQuery | 20 |
6 | Analytics | New Relic | 20 |
7 | Analytics | Google Analytics | 20 |
Things get interesting when we plug in Google Analytics tags scraped from the domains. WARNING: the query consumes 10 TB ($50 @ $5/TB) for a given month, so only run it if you have cost controls set up:
SELECT page, REGEXP_EXTRACT(body, '(UA-114372942-|UA-114396355-|UA-147159596-|UA-147358532-|UA-147552306-|UA-147966219-|UA-147973896-|UA-147983590-|UA-148428291-|UA-149669420-|UA-151957030-|UA-15309596-|UA-474105-|UA-58698159-|UA-75903094-|UA-89264302-)') AS ga
FROM httparchive.response_bodies.2020_02_01_mobile
WHERE page = url
AND REGEXP_CONTAINS(body, '(UA-114372942-|UA-114396355-|UA-147159596-|UA-147358532-|UA-147552306-|UA-147966219-|UA-147973896-|UA-147983590-|UA-148428291-|UA-149669420-|UA-151957030-|UA-15309596-|UA-474105-|UA-58698159-|UA-75903094-|UA-89264302-)')
QUERY RESULTS: Table V: legalnewsline.com and madisonrecord.com go back to 2014.
Initially there was a glitch with the regex - the trailing dash was missing so madisonrecord.com returned correctly with UA-474105-7, but krasivye-pozdravlenija.ru with UA-474105[9]-4 and all sorts of random and unrelated stuff popped up!
That is as far as we have gotten and I am only beginning to cut my teeth with the web transparency data in httparchive…
Please shout if you have any ideas how the HTTP Archive project and the publicly available data in the httparchive repository on Google BigQuery can help us find what else these domains are and were connected to! Query cost is of no concern in light of this information’s want for freedom.