Detecting spam pictures using statistical features presented at Virus Bulletin 2007

by Sandor Antal (Virusbuster),

Tags: Security


Summary : The problem we want to solve is to detect spam messages which contain essential information in an attached picture.
Unfortunately, nowadays spammers usually vary the pictures randomly (e.g. include little dots or lines), which is why
images of two instances of the same spam differ. The aim of the spammers who do this is to avoid their spam
pictures being detected by hash-based methods. Our goal was to eliminate the problems caused by this trick and develop
a fast method which is not as sensitive to the little differences in pictures as the hash-based methods are.
The methods we have developed and use are to calculate statistical parameters of the image file (size, average, STD
etc.) without rendering the image to smooth the image using differnet IF methods (for example Gaussian Blur or
various types of granulation filters) to remove several disturbances (e.g. random dots) to calculate global
parameters of an image (e.g. brightness, contrast) to use these parameters in a hash function which gets similar hash
values for similar pictures. It means that if there is a little difference between the hash values of two pictures
then they are the same or almost the same considering these parameters as spam/ham features and using the
Bayesian method. This means that it is enough to teach only a few (maybe only one) spam instance and (unless the
pictures are varied significantly) the filter can detect the modified variations as well.