Trends in Spam Products and Methods
by Geoff Hulton, Anthony Penta, Gopalakrishnan Seshadrinathan, Manav Mishra
Conference on Email and Anti-Spam,
2004-07-30
Language:
English
Note: Published at CEAS 2004.
Abstract
In this paper we analyze a very large junk e-mail corpus which was generated by a hundred thousand volunteer users of the Hotmail e-mail service. We describe how the corpus is being collected and then discuss how both the products being advertised by spam and the specific exploits being used to avoid spam filters have changed over time. Every day we randomly select one message from the mail stream of each Hotmail volunteer and ask that user to classify it for us. Thanks to these users, we have been receiving tens of thousands of hand classified messages per day, every day for the past year - our database currently contains over ten million classified messages. In this paper we further analyze two samples of the spam from this data, one from early 2003, and one from early 2004. We categorized the spam by the type of product it is selling, and by the types of exploits it uses to avoid spam filters. We are aware of very few other large scale studies of spam. One is the FTC report on false claims in spam [1]. Our study differs by using data sets that were created by randomly sampling over the entire mail stream, rather than by relying on users to report e-mail that offended them; by reporting changes in spam data over time; and by reporting on more categories of spammer exploits. Another relevant large scale study is our analysis of the geographic origins of spam.
