Enabling Cross-Organizational Threat Sharing through Dynamic, Flexible TransformReturn to TOC presented at FIRST 2014

by Chris Strasburg, Andrew Hoying, Daniel Harkness, Scott Pinkerton,

Summary : The objective of the Cyber Fed Model (CFM) project is to facilitate the sharing of actionable and relevant cyber threat data between organizations in near real time. One obstacle to scalable data sharing is making the process of sharing (uploading) as streamlined as possible, while ensuring that data is delivered to recipients in a format they can easily integrate into existing cyber processes.
This is not a new challenge; many SIEM vendors and open source tools have addressed the problem of ingesting and producing data in multiple formats. These approaches tend to fall into two broad categories:
• Use pre/post processing of data to extract fields and allow users to define new fields on the fly (e.g. Splunk)
• Develop an extensive library of event parsers that preprocess all events and extract known fields from them (e.g. ArcSight)
Generally, however, these solutions are specific to the vendor or tool in question. Complicating the scenario is the existence of multiple competing standards for data sharing formats. Even when a local tool is capable of producing data in a number of formats, sharing that data with other organizations still requires a transformation for the recipient to parse and act on it.
To address this problem, the CFM project is working to develop a flexible transform capability which will both apply defined parsers to known well-defined formats, as well as allow data submitters to provide a description of their data format. Three levels of translation are used to provide on-the-fly transformation to an intermediate representation: syntactic, schematic, and semantic. Uploaders (and downloaders) provide descriptions of the provided (desired) format, and the data will be parsed (produced) in that form.
Developing this capability requires addressing a number of questions, for example:
• How do we normalize the semantics between the data produced by various sources for sharing? For example, similar fields with different meanings, when different tools interpret the same data.
• How do we maintain relationship data between submitted indicators during transformation?
• How do we remain flexible enough to represent new proposed formats without requiring code changes or manual parser generation?
• How do we handle cases where the output format cannot represent all the information provided by the original sender?
• Which aspects of this transformation should be handled centrally (e.g. within the service) as opposed to at the client?
In this talk, we will present our design in detail, and discuss how our approach addresses the questions above.