Subscribe to Windows IT Pro

 

Get Newsletters

  • Get the Latest News
  • Product Updates
  • Helpful Tricks
  • Productivity Tips

Subscribe Now!

February 18, 2003 12:00 AM

Bayesian Spam Filters

Windows IT Pro
InstantDoc ID #38059
Rating: (0)

One of the most promising antidotes to spam is so-called Bayesian filtering, which calculates the probability that a given message is spam, based on analysis of messages previously identified as being spam or not being spam. The Bayesian approach demands less maintenance than keyword-based spam filters that require constant updating of word and phrase lists.

Much of the buzz around this technique started with Paul Graham's August 2002 article "A Plan for Spam" (see the first URL below). Although there is some debate about whether Graham's approach is precisely Bayesian, organizations have been exploring Bayesian methods and applying them to spam for several years. Microsoft Research's antispam effort, spearheaded by a group of Bayesian researchers, began in 1997 and has resulted in a patent. If you want to keep up with spam-fighting techniques, some understanding of the Bayesian technique is in order.

The Bayes in Bayesian was an 18th-century British clergyman and amateur mathematician, Thomas Bayes, who suggested in a posthumously published paper that the probability of some event occurring in the future is related to the proportion of times that event occurred in the past under the same circumstances. Later, mathematicians refined Bayes's ideas and, in the 20th century, built a formal system of classification and decision-making and began applying it to many tasks in science and engineering. (I first encountered Bayesian inference in the context of economics.) A key element of the Bayesian approach is that it depends on having some prior information about the problem at hand.

To some extent, the Bayesian approach models our everyday experience of using probability to try to determine the possible outcome of an action and make decisions. The Bayesian interpretation of probability is different from the coin-flipping experiments that most of us did in school (and which, I'm convinced, are largely an effort to convince students of the futility of gambling). Life isn't a series of random experiments from which we calculate frequency distributions. We must make decisions taking into account the likelihood of different consequences arising from those decisions and whether those consequences are good or bad.

In the case of spam, Bayesian inference suggests that if a new message contains text that appeared often in spam in the past but rarely in legitimate messages, then the new message is likely to be spam. The formal methods of calculating such a probability can also take into account the fact that a single false positive--a legitimate message quarantined as spam--is far more costly than many false negatives or spam messages left untouched in your Inbox.

Graham's method analyzes not just the message body but also the message header, which might contain information about the sender's mail server, foreign character sets, and attachments. He claims that his filter catches 99.5 percent of spam with less than one false positive for every 1000 messages received.

Graham presented an update at an antispam conference at MIT last month. He has expanded his list of "tokens"--telltale words and phrases to look for in incoming mail-–to about 187,000 items. And his method can now handle a word differently depending on whether it appears in the subject, in a URL, or in an address field.

Others following Graham's lead are experimenting with variations that calculate the "spamminess" of messages differently. The open-source SpamBayes effort has produced an Outlook add-in (see second URL below). Another free Outlook spam filter using a Bayesian technique is Spammunition, currently in beta. Spam Bully provides a commercial solution.

John Graham-Cumming, the author of POPFile, another open-source project (this one a mail proxy server using a Bayesian filter) reported to the MIT conference that, as well as statistical filters might work, parsing email messages so that such filters can analyze them will continue to be a hard job. Technically savvy spammers constantly devise new ways to make their messages easy for a user to read but difficult for a program to analyze.

"A Plan for Spam" http://www.paulgraham.com/spam.html

SpamBayes http://spambayes.sourceforge.net

Spammunition http://www.upserve.com/spammunition/default.asp

Spam Bully http://spambully.com

POPFile http://popfile.sourceforge.net

Related Content:

ARTICLE TOOLS

Comments
    There are no comments to display. Be the first one!
You must log on before posting a comment.

Are you a new visitor? Register Here

advertisement

advertisement

White Papers

Get your Windows 7 deployment off to the right start by implementing PC lockdown. A locked-down environment is easier and cheaper to support since users are less likely to make unnecessary changes to the core system configuration - read more here!

Essential Guides

Is your iSCSI "lossy"? The reality is that most off-the-shelf Ethernet hardware deployed for iSCSI can lose packets, resulting in slow performance or application downtime. Learn how to assess your current iSCSI infrastructure and engineer an advanced iSCSI SAN infrastructure.

Web Seminars

What's the best way to keep your network safe from malware? In this web seminar, security expert Greg Shields suggests an alternative method to the traditional blacklisting approach that is common with anti-virus and anti-malware solutions.

eLearning Series

We bring the experts direct to you to share their real-world perspective and expertise. During each event, three sessions stream in real time, so you can learn, ask questions, and get solutions.
Upcoming event: Getting the Most with Exchange 2010 with Paul Robichaux

Subscribe to Windows IT Pro!

Windows is a trademark of the Microsoft group of companies. Windows IT Pro is used by Penton Media Inc. under license from owner.