Monitoring Massive Network Traffic Using Bayesian Inference presented at FloCon2019 2019

by David Rodriguez,

Summary : Monitoring network logs from DNS requests to TCP connections is challenging because these logs are both large and noisy- hindering efforts to identify malicious traffic. In a sizable network, for example, it is common to see thousands of requests made to one destination- at one time the frequency is cyclical and at another sporadic. This random behavior in network connections causes most unsupervised and supervised statistical modeling to fail. In this talk we discuss methods for performing large scale Bayesian inference on DNS logs aggregated into count data, representing the number of requests from tens of millions of stub IPs made to hundreds of millions of domains. We describe novel mixtures of common discrete distributions, or hidden Markov processes, that model some of the most sporadic network traffic volumes to domain names. For example, we discuss how the zero inflated Poisson (ZIP) and zero inflated negative binomial (ZINB) distributions, and their more generalized forms, provide parameters we can use to differentiate traffic volumes associated with day-to-day threats from spam and malvertising to widespread threats arising from botnets. Using Apache Spark and Stripe’s newly released Rainier - a powerful Bayesian inference software for the JVM - we run tens of thousands of simulations per domain, fitting the underlying distribution of requests, then repeating this for millions of domains. We profile the performance by fitting a variety of mixtures of distributions to different sporadic traffic volumes. Running simulations often, we then show how to efficiently trend parameter estimates using exponential moving averages to model day/night and weekday/weekend traffic distributions. With hundreds of thousands of simulated and archived traffic patterns associated with benign and malicious network traffic, we show how to reduce false alarms to effectively monitor evolving online threats and masquerading malicious traffic.Attendees will learn:In this session, you’ll learn:• The latest advances in Bayesian inference on the JVM using Stripe’s open sourced Rainier project• To scale Bayesian inference to internet scale datasets using Apache Spark• To build time dependent risk and severity metrics identifying network anomalies associated with pernicious threats like spam, malvertising and botnetsAnd cover mathematical concepts to:• Model sporadic network traffic using discrete probability distributions• Build Hidden Markov Models (HMMs) capturing idle/active states of network traffic• Use Markov chain Monte Carlo (MCMC) methods• Handle outliers, false alarms, and time dependent trends