Securing the Spark Fire Hose presented at lascon 2016

by Jack Mannino,

Summary : Apache Spark is an awesome cluster computing framework used in big data analytics for stream and batch processing. Spark is used for machine learning and predictive analytics using large, streaming data sets from a variety of sources. Spark is often deployed with a distributed messaging system like Kafka, with a high-throughput NoSQL database like Cassandra, and distributed across a cluster of resources with Mesos. As you would imagine, each of these components can hold or process critical data at any given time and each plays a unique role in keeping our data rolling smoothly through the pipeline. We want to make sure that data remains safe at all times, jobs finish in a timely manner, and things remain stable when s**t hits the fan.
This presentation will focus on securing Spark through code and configuration, as well as integrations with commonly implemented technologies. We will take a look at the impact of various attacks against Spark and how to limit exposure. We will examine avoiding common developer-induced issues such as consuming untrusted serialized objects and the joys of misusing closures. Data protection at-rest as well as in-memory and over the network will be explored to give developers insight into how their data is protected throughout its lifetime. At the end of this presentation, you’ll want to use Spark but you’ll also hopefully make better security decisions out of the gate.