The dataset used in the paper "Content Classification of Development Emails", a.k.a. the Mucca paper. You can read the paper here.
The dataset itself is in the data/
directory, which contains three entries:
mucca-dataset.json
the full dataset in json format, containing the emails with reference to the original mailing lists, the original raw text, and the classified sentences.
mucca-dataset-pretty.json
a pretty-printed copy of mucca-dataset.json
mucca-dataset.csv
a list of the classified sentences, with reference to the original email. Each line contains, in this order, the fields:
id
- a unique progressive id of the sentenceemail_id
- the reference to the id of the emailclassification
- the manual classification of the sentence. The possible values are: text, signature, junk, link, stacktrace, code, patchsentence
The Jupyter Notebook contains the operations to generate the csv from the json, should you need to costomize your dataset.
If you use this dataset, please use the reference:
@inproceedings{Bacc2012a,
Author = {Bacchelli, Alberto and Dal Sasso, Tommaso and D'Ambros, Marco and Lanza, Michele},
Booktitle = {In Proceedings of ICSE 2012 (34th ACM/IEEE International Conference on Software Engineering)},
Keywords = {pub-iene, proj-sosya},
Pages = {375--385},
Title = {Content Classification of Development Emails},
Year = {2012}}