Presentation title: Data Hacking Mad Scientist Style presented at BsidesDFW2013 2013

by Brian Wylie,

Summary : This presentation will tap the ‘Mad Scientist’ within us all by demonstrating the practical use of statistics, linear algebra, and machine learning to help understand and organize large datasets when doing security work. Centered on real data, with examples given in Python using publically available modules, the presentation will be focused on the practical usage of the analytic techniques and not the formal mathematical underpinnings. Use cases covered in the presentation will include:
• Organizing large log files
• Simple classification of PE Files
• Browser identification using http logs
• Using PCA to visualize high-dimensional data
The majority of the examples will utilize ‘scikit-learn’ (, which is a popular, open-source (BSD), python module with a wide range of machine learning algorithms. The talk will discuss several algorithms in the toolkit and demonstrate their use on security data; we’ll also cover a popular technique called ‘Banded MinHash’ for Locality Sensitive Hashing (LSH) of sparse log file data.