Pseudo-words for spam detection in an unmodified Naive Bayesian Text Classifier presented at Virus Bulletin 2005

by John Graham-cumming (Popfile project),

Tags: Security


Summary : The POPFile program has proved highly accurate at spam detection with a low false positive rate, yet uses an unmodified Naïve Bayesian Text Classifier with no ‘magic’ values or tweaks. Initially, POPFile performed poorly against spam, but a library of email parsing code and a set of pseudo-words (non-words fed into the classifier that indicate particular email features – e.g. obfuscation of a spammy word, such as Viagra) have brought POPFile to over 99.8% accuracy. This paper will detail every POPFile pseudo-word, how they are created from spam and ham messages, and give empirical data on their importance when scored against a large corpus of spam and ham messages.