Neural Networks For Spam DetectionDescriptionThe idea is to use a neural network to classify spam (unsolicited emails) and ham (wanted, personal messages) emails. The following steps characterize the whole process: Building a word list
Creating the neural network
Train the network
Test the network
Detailed instructions
Improvements for an automated email-gateUse a dynamic network with a fixed count of neurons. Every time a message comes in, its words are added to the toplist and using a page replacement algorithm, barely activated input neurons in the past get replaced by words that were found more frequently than the input neuron to evict. This way the network adapts not only in terms of learning new word patterns in emails, but it learns as well new words. This is important because while anti-spam developers try to improve their detection tools, spammers try to circumvent those measures - and this results in significant different types of messages. Observations
ResultsError while training
Error of all trained mails
The first 60 mails are all spam mails, the rest is normal. Error of all patterns in the test set
Again, the first 60 mails are all spam mails, the rest is normal. AccuracyWhen working on sensible fields like this, the danger of marking a ham message falsely as spam should not be underestimated. To mitigate this a correcting factor of 0.9 was introduced when calculating the spamicity of a message. A threshold of 0.5 was used then to make the choice between ham and spam. Ham factor 0.9 Spam factor 0.1 Threshold 0.5 False Negatives (of 60) 6 10.00% False Positives (of 150) 2 1.33% Accuracy (True negatives) 90.00% Accuracy (True positives) 98.67% These figures show that with this neural network 98.67% of all ham messages were correctly classified as ham (2 false positives). 90% of the spam messages were correctly classified (with 6 false negatives). This means that you would have got 90% less spam, and only 1.33% wanted messages would have landed in the junk mail folder. ResourcesYou can download the Neural Network FrameworksBayesian Filtering
Spam Databases
Related Projects and Research
Other ideas
Copyright and License
© Copyright 2004 - 2006 Nicola Fankhauser. All Rights Reserved. |