HomePage
WikiBlog
RecentChanges
LikePages
BackLinks
FindPage

Blog entries

2010-02-16: RSS-Feeds für ricardo.ch-Suchergebnisse

2010-02-15_spam: Disabling public editing

2010-02-15: Show Processes waiting for I/O in Linux

2010-02-13: Sony Vaio P in der Schweiz...

2009-06-17: Wie man plötzlich zum Mörder wird (weil's einfach besser 'rüberkommt!)

2008-05-18: Recreating SSL keys for stunnel, lighttpd and dovecot following the Debian-OpenSSL debacle

2008-02-12: Patch for aoeserver in Kernel 2.6.24

2008-02-07: Usage of open-iscsi on Linux

2008-02-05: Linux and Windows working in harmony with iSCSI

2008-01-19: Linksys PAP2 FXS Port Impedance

< July, 2010 >
Sun Mon Tue Wed Thu Fri Sat
1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31

Neural Networks For Spam Detection

Description

The idea is to use a neural network to classify spam (unsolicited emails) and ham (wanted, personal messages) emails.

The following steps characterize the whole process:

Building a word list

  • Get ~100 spam messages.
  • Build a list of words (along with their number of appearance) occuring in all 100 messages, sort it (use number of appearance as index).
  • Chose a handy set of words from the top n of the word list, eliminate domain names etc.

Creating the neural network

  • Use the top n word list as input neurons
  • Add one hidden layer with as many neurons as input neurons
  • Create two output neurons (one for spam and one for ham)
  • Connect every input neuron with every hidden neuron, and every hidden neuron with the two output neurons

Train the network

  • Create a training set: take the spam emails from the wordlist-set and add ~100 normal, personal ham mails.
  • Count in each message the occurrences of the words in the word list and set the according input neuron to this value.
  • Activate the network and adjust the weights of all connections using an algorithm like Backpropagation, in funtion of the difference between desired and actual output neuron value.
  • Repeat these steps until all messages in the testing corpus have been used for training

Test the network

  • Take another 100 spam emails and 100 normal, personal ham mails, like when building a training set.
  • Check how good the trained network performs on these previously unknown messages

Detailed instructions

  1. Get a linux box. All the scripts I wrote are for the Bash shell. Be warned, use small sets of email first, as my scripts are not optimized for speed.
  2. All email messages have to be in maildir format, you can use the shell script 2md if you have mails in mailbox (mbox) format (like I did). Typically, you'll want three types of emails
    1. Ham mails (belongs to the training set)
    2. Spam mails (belongs as well to the training set)
    3. A test set with mixed mails, to see what the network is able to do.
  3. Use generate_combined to create wordlists out of known spam mail-messages. Redirect the output of this script to a file (e.g. words).
  4. Use generate_network with your wordlist 'words' to generate the JavaNNS network. Again, redirect its output to a file (e.g. called network).
  5. For all three of ham, spam and test mails use generate_patterns to generate the patterns (training data for JavaNNS). Redirect its output to a file (e.g. called patterns{ham,spam,test}).
  6. Now open the network in JavaNNS and train it with the patterns for ham and spam.
  7. Time to test your network and look how well it performs: open the test-pattern and disable training for it when you run it. Refer to the JavaNNS documentation for information how to use JavaNNS.

Improvements for an automated email-gate

Use a dynamic network with a fixed count of neurons. Every time a message comes in, its words are added to the toplist and using a page replacement algorithm, barely activated input neurons in the past get replaced by words that were found more frequently than the input neuron to evict.

This way the network adapts not only in terms of learning new word patterns in emails, but it learns as well new words. This is important because while anti-spam developers try to improve their detection tools, spammers try to circumvent those measures - and this results in significant different types of messages.

Observations

  • Success rate depends heavily on the kind spam and ham email you get. If your normal emails are mostly written in one language (e.g. German or French) and you only get spam mails in another language (e.g. English), this scheme works very well.
  • Base64 encoded messages are difficult to classify (but are mostly used by spammers to obfuscate content)
  • It needs further research how the wordlist should be processed before using it as input to the neural network. We simply eliminated very frequent English words (e.g. you , this). A good approach would be comparing a statistical analysis of the English language with the generated wordlist and eliminate frequent English words.

Results

Error while training

http://variant.ch/papers/NNSpam/small_training_error.png

Error of all trained mails

http://variant.ch/papers/NNSpam/small_training.png

The first 60 mails are all spam mails, the rest is normal.

Error of all patterns in the test set

http://variant.ch/papers/NNSpam/small_test.png

Again, the first 60 mails are all spam mails, the rest is normal.

Accuracy

When working on sensible fields like this, the danger of marking a ham message falsely as spam should not be underestimated. To mitigate this a correcting factor of 0.9 was introduced when calculating the spamicity of a message. A threshold of 0.5 was used then to make the choice between ham and spam.

 Ham factor     0.9
 Spam factor    0.1
 Threshold      0.5

 False Negatives (of 60)        6       10.00%
 False Positives (of 150)       2       1.33%

 Accuracy (True negatives)      90.00%
 Accuracy (True positives)      98.67%

These figures show that with this neural network 98.67% of all ham messages were correctly classified as ham (2 false positives). 90% of the spam messages were correctly classified (with 6 false negatives). This means that you would have got 90% less spam, and only 1.33% wanted messages would have landed in the junk mail folder.

Resources

You can download the package with all needed files to run NNSpam.

Neural Network Frameworks

  • Java Neural Network Simulator JavaNNS
  • Java Object Oriented Neural Engine Joone

Bayesian Filtering

  • An evaluation of Naive Bayesian anti-spam filtering, Ion Androutsopoulos et al.
  • Naive Bayesian Learning, Charles Elkan
  • Naive Bayes Classifiers poster: http://www.coli.uni-sb.de/~crocker/Teaching/Connectionist/lecture10_4up.pdf
  • Spam-filtering techniques using Bayesian filters: http://www.paulgraham.com/spam.html

Spam Databases

  • The Great Spam Archive
  • Database of known spam: http://www.spamarchive.org/
  • UCI Machine Learning Repository

Related Projects and Research

  • A Hybrid Neural Network for Automated Classification, Samea A. Wood and Tamás D. Gedeon
  • Spamfilter v1.0, Bob Boyer and William Kerney

Other ideas

  • Use of gzip algorithm to remove redundance: http://www.kuro5hin.org/story/2003/1/25/224415/367
  • General definition of spam: http://www.ai.mit.edu/~jrennie/spamconference/

Copyright and License

  • Copyright 2002, 2003 by Christian Eichenberger, Nicola Fankhauser
  • All source code, files and this document are released under the Gnu Public License (GPL).

Last edited on 14.10.2003 14:47.


Edit | PageHistory | Diff | PageInfo

© Copyright 2004 - 2006 Nicola Fankhauser. All Rights Reserved.