Detecting Threats, Not Sandboxes: Characterizing Network Environments to Improve Malware Classification presented at Flocon 2017

by David Mcgrew, Blake Anderson,

Summary : Applying supervised machine learning to network data features is increasingly common; it is well suited for tasks such as the detection of malicious flows and application identification. In these applications, it is essential to avoid biases that can arise due to the fact that different training datasets are obtained in different network environments. Unfortunately, it is not straightforward to understand how these environments can introduce biases; many previous studies have not even attempted to do so. In this work, we focus on the important case of training data obtained from malware sandboxes, and its use in detecting malware communications on enterprise networks. We present techniques to identify data features derived from the TCP/IP, TLS, DNS, and HTTP protocols that are artifacts of network environments, and show data features that are invariant across those environments.
HTTP headers provide a good example; the user-agent is often but not always invariant. The via header, on the other hand, indicates that a flow has passed through a proxy, and thus it is not representative of the application's type or intention, but rather a feature of the network environment. In our datasets, nearly 100% of the enterprise HTTP flows contained the "via" header, but this was uncommon in the malware sandbox dataset. A naïve application of machine learning would use this fact to achieve low error in cross-validation tests, but it would also fail at capturing the concept of maliciousness, and its efficacy on real network traffic would suffer. A similar situation holds for TLS, which contains a complex set of data features. Most Windows sandboxes use the XP version to maximize the probability that the submitted malware sample executes. TLS flows that take advantage of the underlying operating system's TLS library would use an outdated version of SChannel. In the cases where the malware samples use SChannel, offering obsolete TLS ciphersuites is not an inherent feature of the malware, but rather a feature of the sandbox environment. Understanding and accounting for these biases is necessary to create machine learning models that can accurately discern malicious traffic versus that of enterprise traffic, and not simply learn to classify different network environments. In addition to highlighting these pitfalls, we offer solutions to the problems and demonstrate their results. By understanding the target network environment and creating training datasets composed of synthetic samples, we can systematically avoid a sandbox bias. For example, when monitoring a network with a web proxy enabled and where Windows 10 is the most prevalent operating system, we create synthetic HTTP flows by modifying the existing malware HTTP flows to include the appropriate "via" header. Similarly, we modify the TLS ciphersuite offer vector and extensions to resemble the appropriate version of SChannel. Finally, we use the synthetic malware dataset and baseline benign data collected from the enterprise network to create robust machine learning classifiers that can be deployed on the enterprise network.