Learning to Extract Signature and Reply Lines from Email
by Vitor R. Carvalho, William W. Cohen
Conference on Email and Anti-Spam,
2004-07-30
Language:
English
Note: Published at CEAS 2004.
Abstract
We describe methods for automatically identifying signature blocks and reply lines in plain-text email messages. This analysis has many potential applications, such as preprocessing email for text-to-speech systems; anonymization of email corpora; improving automatic content-based mail classifiers; and email threading. Our method is based on applying machine learning methods to a sequential representation of an email message, in which each email is represented as a sequence of lines, and each line is represented as a set of features. We compare several state-of-the-art sequential and non-sequential machine learning algorithms on different feature sets, and present experimental results showing that the presence of a signature block in a message can be detected with accuracy higher than 97%; that signature block lines can be identified with accuracy higher than 99%; and that signature block and reply lines can be simultaneously identified with accuracy of higher than 98%.
