Exploiting spammers' tactics of obfuscations for better corporate level spam filtering presented at Virus Bulletin 2006

by Vipul Sharma (Proofpoint inc.),

Tags: Security

URL : http://www.virusbtn.com/conference/vb2006/abstracts/SharmaLewis.xml

Summary : "Spam filters that rely on machine learning often use the content of the emails to generate
features for the classification model. One of the famous tricks of fooling such spam filters
is to introduce random text or noise in the emails text - for example 'Viagra' is spelled
as '/|@gr@' and 'mortgage' as 'm_o_r_t_g-a-g-e'.
The problem of obfuscation becomes quite cumbersome because there are endless
ways to obfuscate a given word and hence the feature space of the spam classification
model has to be updated frequently with all such words that are seen in spam emails.
This also introduces a spam counter lag since the feature space is updated after such words
are seen in spam emails. There are at least two possible methods to counter the text obfuscation problem. The first
method is to de-obfuscate the spam message as a preprocessing step of classification.
Previous research has proved that de-obfuscating spam emails gives the best classification
accuracy, but it also suffers from performance-related issues. These drawbacks cause extra
damage to the enterprise class spam solutions where the number of emails is extremely large;
in the order of tens of millions per day. Any such slow and computationally-expensive preprocessing
technique will increase the email delivery time and hardware requirements.
This not only makes the solutions more expensive for the end users and but also creates
severe performance issues for the service providers. Taking the above constraints into consideration, another technique to counter
obfuscation is to identify the obfuscated words in an email and use them as an indicator of spam.
Previous research reports a success rate of 75% in catching spam emails using such
To quantify a better trade off between the performance and the classification
accuracy of the technique, we compared several classification algorithms on this technique.
We report the empirical comparison of various multivariate classification techniques
(e.g. random forests, Bayesian classification, C4.5 etc.) for obfuscation detection. Our study also shows that by localizing the solution of the problem of obfuscation
on certain 'frequently obfuscated words' and using preprocessing techniques like discretization
for feature generation, the detection accuracy can be increased to around 96%, simultaneously
keeping the computational and timing cost to a minimum.
We also report a significant average increase of 0.2% in the enterprise level spam
filtering effectiveness due to auxiliary classification models such as obfuscation detector. "