An email segmentation system:
- reference implementation of ECIR 2018 paper
- annotated datasets
- newly collected ASF email corpus, annotated by email zones only
- selection of Enron corpus, annotated by email zones only
- selection of Enron corpus, detailled annotation (including names, aliases, metadata)
- annotated using Enno, util classes to read format included here
Repke, Tim and Krestel R. Bringing Back Structure to Free Text Email Conversations with Recurrent Neural Networks. ECIR 2018
Email communication plays an integral part of everybody’s life nowadays. Especially for business emails, extracting and analysing these communication networks can reveal interesting patterns of processes and decision making within a company. Fraud detection is another application area where precise detection of communication networks is essential. In this paper we present an approach based on recurrent neural networks to untangle email threads originating from forward and reply behaviour. We further classify parts of emails into 2 or 5 zones to capture not only header and body information but also greetings and signatures. We show that our deep learning approach outperforms state-of-the-art systems based on traditional machine learning and hand-crafted rules. Besides using the well-known Enron email corpus for our experiments, we additionally created a new annotated email benchmark corpus from Apache mailing lists.
- Original Code for Jangada, Carvalho, 2004
- More infos and data for Jangada (600+ annotated mails in 20 newsgroup dataset)
- MinorThird Library used by Jangada
- 400 annotated emails by Lampert et. al (Enron data)
- Zebra System for email zoning
- Another implementation of Zebra
- Talon is an awesome universal tool for everything that has to do with email structure