Acquisition and Analysis of Large Scale Network Data presented at AUScert 2007

by John Mchugh, Ron Mcleod,

Summary : Introduction:
Detecting malicious activity in network traffic is greatly complicated by the large amounts of noise, junk, and other questionable traffic that can serve as cover for these activities. With the advent of low cost mass storage devices and inexpensive computer memory, it has become possible to collect and analyze large amounts of network data covering periods of weeks, months, or even years. This tutorial will present techniques for collecting and analyzing such data, both from network flow data that can be obtained from many routers or derived from packet header data and directly from packet data such as that collected by TCPDump, Ethereal, and Network Observer. This version of the course will contain examples from publicvally available packet data such as the Dartmouth Crawdad wireless data repository and will deal with issues such as the acquisition of data in IP-unstable environments such as those involving DHCP. Because of the quantity of the data involved, we develop techniques, based on filtering of the recorded data stream, for identifying groups of source or destination addresses of interest and extracting the raw data associated with them. The address groups can be represented as sets or multisets (bags) and used to refine the analysis. For example, the set of addresses within a local network that appear as source addresses for outgoing traffic in a given time interval approximates the currently active population of the local network. These can be used to partition incoming traffic into that which might be legitimate and that which is probably not since it is not addressed to active systems. Further analysis of the questionable traffic develops smaller partitions that can be identified as scanners, DDoS backscatter, etc. based on flag combinations and packet statistics. Traffic to and from hosts whose sources appear in both partitions can be examined for evidence that its destinations in the active set have been compromised. The analysis can also be used to characterize normal traffic for a customer network and to serve as a basis for identifying anomalous traffic that may warrant further examination.
General familiarity with IP network protocols. Elementary familiarity with simple statistical measures.
Textbooks, etc.:
At the present time, there is no suitable textbook. Participants will receive copies of the SiLKtools analysis handbook, supplemented by reprints of selected publications from the technical literature. The tools, themselves are freely available at and run on a variety of Unix based systems, including OS X. The group at Dalhousie is supplementing the SiLK toolset from CERT and these tools are available, as well and will be included in the discussion.
I Introduction (45 Minutes)
A Review of IP packet structures
B Network data collection tools
1 Cisco NetFlow
2 TCP dump / Fprobe /etc.
C A quick tour of "interesting data"
II Data Collection (45 Minutes)
A Netflow and similar abstractions
B Packet data
C DHCP and dynamic addressing
III The SiLKtools Analysis Suite (90 Minutes)
A Data fields and features
B Selecting data for analysis
1 Selecting raw data records
a rwfilter, a flow selector
b Converting packet data to approximate flows
2 Building sets of IP addresses - rwset, rwbag, buildset, etc.
3 Manipulating sets and bags - bag to set, set union,
set intersection, bag addition, etc.
4 Partitioning raw data with sets
C Elementary analysis
1 Network structure - subnet analysis of IP sets
2 Feature extraction - rwcut for raw data
3 Ordering data by time or features - rwsort
4 Flow Counting - rwaddrcount
5 Top N IPs for some N and some feature - rwunique
IV Advanced Analysis (90 Minutes)
A Finding Connections
1 Bloom filters and other sparse relationships
2 Eliminating Non-connections
3 Consolidating unidirectional flows
4 Matching bidirectional components
B Looking for scanners
1 High density scanners - the obvious cases, long term trends
2 Worm residue and related noise
3 Low rate and distributed scans
C Clustering extracted features - rolling your own tools
V Case studies (60 Minutes)
A Worms and worm outbreaks (a recent case, if possible)
B Estimating DDoS attack severity
C A collection of strange individual host behaviors
D Analysis of emergent internet behaviors
E Enterprise level analysis, a case study
VI General Questions and Discussion (30 minutes)