Lunchtime Table Talk: Data Science Behind the Scenes, Part 2 - "Tidy" Data for Network Traffic Analysis presented at FloCon2019 2019

by Andrew Fast,

Summary : Data science is rapidly becoming an integral part of the network security industry. Although widespread applications of data science in network security are relatively recent, data science has roots going back decades. Unfortunately, this maturity presents an obstacle for those who are new to the field and seeking to learn. Furthermore, most presentations (whether spoken or written) tend to focus only on the final model and performance results, pushing to the background many of the critical intermediate steps required for success.The goal of these “Behind the Scenes” lunchtime talks is to help bridge the gap between network analysts and data scientists by providing an overview of some of the foundational, but often unseen, steps that lead to a successful data science result. These talks are meant to be accessible to those desiring to learn more about data science and are intended to benefit network analysts and data scientists alike.Intended Audience: Anyone who does, leads, or manages data science projects and wants to go behind the models to learn strategies for increasing data science success.Behind the Scenes, Part 2: “Tidy” Data for Network Traffic AnalysisA critical component for having success with data science is transforming “messy” data into a format suitable for input into data science and machine learning algorithms. Hadley Wickham, one of the premier contributors to the R ecosystem, named the ideal end result “tidy” data. Data scientists estimate 80% of a data science project is spent tidying data. Despite the effort required, tidying data is typically viewed as peripheral to the more exciting algorithms used to get the results. We go behind the scenes to explore what “tidy” looks like for three types of data encountered in network security use cases (tabular, time series, and graph data) and highlight how to transform one data type to another.