I am working on a git repo and a group effort to monitor the attack vectors mentioned in the billion-dollar disinformation campaign to reelect the president in 2020: https://github.com/MassMove/AttackVectors… a quick intro:
Presiding over this effort is Brad Parscale , a 6-foot-8 Viking of a man with a shaved head and a triangular beard . As the digital director of Trump’s 2016 campaign , Parscale didn’t become a household name like Steve Bannon and Kellyanne Conway. But he played a crucial role in delivering Trump to the Oval Office— and his efforts will shape this year’s election .
Parscale has indicated that he plans to open up a new front in this war: local news. Last year, he said the campaign intends to train “swarms of surrogates” to undermine negative coverage from local TV stations and newspapers. Polls have long found that Americans across the political spectrum trust local news more than national media. If the campaign has its way, that trust will be eroded by November.
Running parallel to this effort, some conservatives have been experimenting with a scheme to exploit the credibility of local journalism . Over the past few years, hundreds of websites with innocuous-sounding names like the Arizona Monitor and The Kalamazoo Times have begun popping up. At first glance, they look like regular publications, complete with community notices and coverage of schools. But look closer and you’ll find that there are often no mastheads, few if any bylines, and no addresses for local offices. Many of them are organs of Republican lobbying groups; others belong to a mysterious company called Locality Labs, which is run by a conservative activist in Illinois. Readers are given no indication that these sites have political agendas—which is precisely what makes them valuable.
When Twitter employees later reviewed the activity surrounding Kentucky’s election, they concluded that the bots were largely based in America— a sign that political operatives here were learning to mimic [foreign tactics] .
Their shit looks really real: https://kalamazootimes.com until you start looking at all the articles at once: https://kalamazootimes.com/stories/tag/126-politics
I was made aware of the remarkable work you guys and gals did rooting out hidden crypto-currency miners: https://discuss.httparchive.org/t/the-performance-impact-of-cryptocurrency-mining-on-the-web/1126. And was hoping to get a hand or two with this group effort to give democracy a fighting chance.
With some divine guidance I was able to upload sites.csv to httparchive.scratchspace.massmove:
SELECT * FROM `httparchive.scratchspace.massmove` LIMIT 1000
And have run some initial queries, like this one to find if there are any common 3rd party hosts among the known sites:
SELECT APPROX_TOP_COUNT(NET.HOST(url), 20) AS req_host FROM httparchive.summary_requests.2020_02_01_mobile JOIN ( SELECT pageid, url AS page FROM httparchive.summary_pages.2020_02_01_mobile ) USING (pageid) WHERE NET.REG_DOMAIN(page) IN ( SELECT DISTINCT domain FROM httparchive.scratchspace.massmove )
The most popular host is jnswire.s3.amazonaws.com. So we flipped the query around and looked for any website that made a request to that host:
SELECT DISTINCT page FROM httparchive.summary_requests.2020_02_01_mobile JOIN ( SELECT pageid, url AS page FROM httparchive.summary_pages.2020_02_01_mobile ) USING (pageid) WHERE STARTS_WITH(url, 'https://jnswire.s3.amazonaws.com')
There are 21 results: the 20 known sites plus rgs-istilah-hukum.blogspot.com. They innocently, but still interestingly enough just hot-link this image: https://jnswire.s3.amazonaws.com/jns-media/98/f7/176642/discrimination_16.jpg from the network: https://madisonrecord.com/stories/511475189-woman-alleges-race-was-factor-in-termination-from-cooper-b-line and https://legalnewsline.com/stories/511427360-eeoc-south-carolina-child-development-center-fired-employee-over-drug-prescription
The dataset can do other interesting things, like give a rough fingerprint of web technologies used to build the sites:
SELECT category, app, COUNT(0) AS freq FROM httparchive.technologies.2020_02_01_mobile WHERE NET.REG_DOMAIN(url) IN ( SELECT DISTINCT domain FROM httparchive.scratchspace.massmove ) GROUP BY app ORDER BY freq DESC
The results show all 20 sites using nginx, Facebook (like button probably), jQuery, GTM, etc. So maybe this info could be used to look for other similarly-built sites:
|2||Tag Managers||Google Tag Manager||20|
Things get interesting when we plug in Google Analytics tags scraped from the domains. WARNING: the query consumes 10 TB ($50 @ $5/TB) for a given month, so only run it if you have cost controls set up:
SELECT page, REGEXP_EXTRACT(body, '(UA-114372942-|UA-114396355-|UA-147159596-|UA-147358532-|UA-147552306-|UA-147966219-|UA-147973896-|UA-147983590-|UA-148428291-|UA-149669420-|UA-151957030-|UA-15309596-|UA-474105-|UA-58698159-|UA-75903094-|UA-89264302-)') AS ga FROM httparchive.response_bodies.2020_02_01_mobile WHERE page = url AND REGEXP_CONTAINS(body, '(UA-114372942-|UA-114396355-|UA-147159596-|UA-147358532-|UA-147552306-|UA-147966219-|UA-147973896-|UA-147983590-|UA-148428291-|UA-149669420-|UA-151957030-|UA-15309596-|UA-474105-|UA-58698159-|UA-75903094-|UA-89264302-)')
Initially there was a glitch with the regex - the trailing dash was missing so madisonrecord.com returned correctly with UA-474105-7, but krasivye-pozdravlenija.ru with UA-474105-4 and all sorts of random and unrelated stuff popped up!
That is as far as we have gotten and I am only beginning to cut my teeth with the web transparency data in httparchive…
Please shout if you have any ideas how the HTTP Archive project and the publicly available data in the httparchive repository on Google BigQuery can help us find what else these domains are and were connected to! Query cost is of no concern in light of this information’s want for freedom.