When Threat Hunting Fails: Identifying Malvertising Domains Using Lexical Clustering presented at flocon 2018

by Dhia Mahjoub, David Rodriguez, Matthew Foley,

Summary : From Java drive-bys to Adobe Flash exploits, low and mid-tier ad networks have traditionally been targeted and popularized as the distribution point for malicious campaigns. The ad network infrastructure enables a variety of distribution methods especially if an attacker understands how to game the ad-exchange. Further, malvertising groups have begun to evolve towards more ambitious campaigns serving ad impressions under the guise of fake software updates and tech support scams. Defending against and harvesting the fake update and tech support scams is complicated, however, by the fingerprinting and anti-bot technologies of the poorly-vetted ad networks that act as a middle-man and are hidden behind. The actors launching these attacks are also vigilant, launching these attack with fresh registered domains and migrating between hosting infrastructures. The question then becomes, which if any of the traditional threat hunting method can be effective against this new breed of malvertising? In this talk, we introduce a real-time streaming pipeline built in Kafka to stem the initial attack that is observable in DNS logs by using a scalable clustering technique known as locality sensitive hashing (LSH) over the hostnames to identify the permutations of words and characters from “software”, “update”, “tech”, “support”, and more. We then discuss a novel belief propagation algorithm through a client-hostname bipartite graph that propagates up the related file hosts that lay behind malicious advertisements. Finally, we will disclose the anatomy of a malicious advertising campaign and uncover how the file hosts are often reused in malvertising campaigns.Attendees will learn:Attendees will become acquainted with the current malvertising threat landscape: ad networks, exchanges, exploits, and popular infection points. The audience will gain a greater understanding of the need for unsupervised lexical clustering, due to the weaknesses of traditional methods of lexical and semantic analysis, and how these methods can be applied to threat hunting. Finally, we'll show how to leverage commodity hardware and open source technologies to uncover more threats and their related infrastructures. This talk will demonstrate how to automate data analysis to identify evolving threats where traditional hand-crafted threat research methods may fail or prove inefficient.