All Things Email

About | Contact

Spam Filtering Using Statistical Data Compression Models

by Andrej Bratko, Bogdan Filipic, Gordon V. Cormack, Thomas R. Lynam, Blaz Zupan

2006
Language: English

Note: Journal of Machine Learning Research.

External links

Full text: PDF

Information about this paper

Abstract

Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task calls for fast, incremental and robust learning algorithms. In this paper, we investigate a novel approach to spam filtering based on adaptive statistical data compression models. The nature of these models allows them to be employed as probabilistic text classifiers based on character-level or binary sequences. By modeling messages as sequences, tokenization and other error-prone preprocessing steps are omitted altogether, resulting in a method that is very robust. The models are also fast to construct and incrementally updateable. We evaluate the filtering performance of two di erent compression algorithms; dynamic Markov compression and prediction by partial matching. The results of our extensive empirical evaluation indicate that compression models outperform currently established spam filters, as well as a number of methods proposed in previous studies.

Creative Commons. Some Rights Reserved.
Copyright © 2004 Jochen Topf
Unless otherwise noted the contents on this site are licensed under the
Creative Commons Attribution-ShareAlike License.