TREC 2005 Spam Track Overview
by Gordon V. Cormack, Thomas R. Lynam
2005
Language:
English
Note: Proc. TREC 2005 - the Fourteenth Text REtrieval Conference, Gaithersburg, 2005.
Abstract
TREC's Spam Track introduces a standard testing framework that presents a chronological sequence of email messages, one at a time, to a spam filter for classification. The filter yields a binary judgement (spam or ham [i.e. non-spam]) which is compared to a human-adjudicated gold standard. The filter also yields a spamminess score, intended to reflect the likelihood that the classified message is spam, which is the subject of post-hoc ROC (Receiver Operating Characteristic) analysis. The gold standard for each message is communicated to the filter immediately following classification. Eight test corpora - email messages plus gold standard judgements - were used to evaluate 53 subject filters. Five of the corpora (the public corpora) were distributed to participants, who ran their filters on the corpora using a track-supplied toolkit implementing the framework. Three of the corpora (the private corpora) were not distributed to participants; rather, participants submitted filter implementations that were run, using the toolkit, on the private data. Twelve groups participated in the track, submitting 44 filters for evaluation. The other nine subject filters were variants of popular open-source implementations adapted for use in the toolkit in consultation with their authors.
