Clean data profiling presented at Virus Bulletin 2008

by Catherine Robinson (Symantec),

Tags: Security


Summary : The volume of malicious software being created at present is so high that it has triggered discussion in the AV industry
as to whether a blacklisting model is feasible in the future. In this context, clean data sets are becoming increasingly
important and so is the need to classify them.
In this paper, we discuss problems and solutions related to gathering and profiling large clean data sets. We provide
guidelines for gathering clean files and keeping them uncompromised, determining their level of trust and their intrinsic
quality (usefulness).
We present a systematic approach to profiling files and managing the metadata in a clean set. Considering the nature
of the data that needs to be extracted we group the profiling metadata into two categories: lower-level and higher-level
information. The lower-level data is extracted automatically directly from files and contains information that helps
in locating files and determining the type of files. Higher-level metadata consists of information that allows file
categorisation. We present the possible sources of this information that could be obtained automatically or with manual
annotations. We also attempt to define a naming convention for identifying software and standardising the type of data
that can be queried.
Finally, we have a look at existing clean data sets, profiled and unprofiled, and their shortcomings for this particular