Dainis Boumber (dboumber@uh.edu), Yifan Zhang (yzhang114@uh.edu), Arjun Mukherjee (arjun@cs.uh.edu)
University of Houston
Publication pending review.
We explore the use of Convolutional Neural Networks (CNNs) for multi-label Authorship Attribution (AA) problems and propose a CNN specifically designed for such tasks. By treating smaller documents as sentences and averaging the author probability distributions at sentence level for the longer documents, our design adapts to single-label datasets and various document sizes, retaining the capabilities of a traditional CNN. As a part of this work, we also create and make available to the public a multi-label Authorship Attribution dataset (MLPA-400) , consisting of 400 scientific publications by 20 authors from the field of Machine Learning. Experimental results demonstrate that our method outperforms several state-of-the-art models on the proposed task.
Prerequisits: scikit-learn, tensoflow v1.0, python3
Run python aa.py
for a sample set of experiments.
Any additional data is to be placed under datahelpers/data
(default, can be changed)
Machine Learning Papers' Authorship (MPLA-400) dataset contains 20 publications by each of the top-20 authors in ML, for the total of 400.
The data is located in ./ml_dataset
directory. You can also obtain it as a tarball from
https://drive.google.com/open?id=0B_LjdXSWGw1YR2dIek95bGFfZEE
Labels.csv
contains the ground truths in the following format: ,<author_1>,...<author_20>\n
is plain text and <author_n> is a digit 0 or 1 indicating whether this author is one of the co-authors. The first row is the header row.
See MLPA-400 for more details.