mucca-dataset

The dataset used in the paper "Content Classification of Development Emails", a.k.a. the Mucca paper. You can read the paper here.

The dataset itself is in the data/ directory, which contains three entries:

mucca-dataset.json

the full dataset in json format, containing the emails with reference to the original mailing lists, the original raw text, and the classified sentences.

mucca-dataset-pretty.json

a pretty-printed copy of mucca-dataset.json

mucca-dataset.csv

a list of the classified sentences, with reference to the original email. Each line contains, in this order, the fields:

id - a unique progressive id of the sentence
email_id - the reference to the id of the email
classification - the manual classification of the sentence. The possible values are: text, signature, junk, link, stacktrace, code, patch
sentence

The Jupyter Notebook contains the operations to generate the csv from the json, should you need to costomize your dataset.

Citing the dataset

If you use this dataset, please use the reference:

@inproceedings{Bacc2012a,
	Author = {Bacchelli, Alberto and Dal Sasso, Tommaso and D'Ambros, Marco and Lanza, Michele},
	Booktitle = {In Proceedings of ICSE 2012 (34th ACM/IEEE International Conference on Software Engineering)},
	Keywords = {pub-iene, proj-sosya},
	Pages = {375--385},
	Title = {Content Classification of Development Emails},
	Year = {2012}}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
.gitignore		.gitignore
Dataset Import.ipynb		Dataset Import.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mucca-dataset

Citing the dataset

About

Releases

Packages

Languages

License

dalsat/mucca-dataset

Folders and files

Latest commit

History

Repository files navigation

mucca-dataset

Citing the dataset

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages