Skip to content

Commit

Permalink
More details in README
Browse files Browse the repository at this point in the history
  • Loading branch information
Charles Marsh committed May 13, 2014
1 parent aa819a3 commit 17a3f93
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 5 deletions.
13 changes: 10 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@

# MAD Topic Model

Wordless authorship recognition using topic models.
Wordless authorship recognition using topic models. MAD implements a multivalent supervised Latent Dirichlet Allocation algorithm that operates on vocabularies of n-gram stylistic and stylometric features, such as part-of-speech tags and syllable counts. MAD can be used for both author classification and exploratory analysis as the generated topic models reveal hidden structure among authors' writing styles.

## Dependencies

See requirements.txt for Python dependencies.

The following NLTK libraries are required:
In addition, the following NLTK libraries are required:

- cmudict
- punkt
- maxent_treebank_pos_tagger
Expand All @@ -24,4 +25,10 @@ The `slda_input_files` folder contains files in a generalized format ready to be

## Models

The MAD model is implemeneted in models/slda. Our implementation Chong Wang's sLDA implementation at http://www.cs.cmu.edu/~chongw/slda/. The algorithm is mostly coded in slda.cpp, with help from opt.cpp and dirichlet.c for computing difficult gradients. settings.h contains a (for the time being) rather disorganized list of settings, allowing for different inference schemes (variational, stochastic, and the orginal fixed point/variational combination), regularization - L1 (not fully tested. uses http://www.chokkan.org/software/liblbfgs/), L2, and smoothing for the Dirichlet MLE and per-topic vocabulary distributions. settings.h also controls the number of iterations the algorithm should run. For some reason, the settings.h is having difficulty reading the settings.txt input file, so it is best to hard code the desired settings into settings.h. Another README is provided within models/slda explaining how to run the software from command line.
The MAD model is implemented in `models/slda`. Our implementation is based on that of [Chong Wang](http://www.cs.cmu.edu/~chongw/slda/). The algorithm is mostly contained in `slda.cpp`, with help from `opt.cpp` and `dirichlet.c` for computing difficult gradients.

`settings.h` contains a list of settings, allowing for different inference schemes (variational, stochastic, and the original fixed point-variational combination), regularization (including L1 (not fully tested, but based on [liblbfgs](http://www.chokkan.org/software/liblbfgs/)), L2, and smoothing for the Dirichlet MLE) and per-topic vocabulary distributions. `settings.h` also controls the number of iterations the algorithm should run. `settings.h` has difficulty reading the settings.txt input file, so it is best to hard code the desired settings into `settings.h`. A supplemental README is provided within `models/slda` with information on how to run the software from command line.

## Feature Extraction

Python modules are provided for extracting the necessary stylistic features from text, such as part-of-speech tags and syllable counts. The majority of the feature extraction functions can be found in `features/analyzer.py`. In addition, a novel technique for extracting meter 8-grams is provided in `features/meter.py`. To allow for further analysis (after n-gram generation), `features/extract.py` provides functions for identifying text snippets within documents that match the given stylistic n-gram.
4 changes: 2 additions & 2 deletions report/writeup.tex
Original file line number Diff line number Diff line change
Expand Up @@ -80,9 +80,9 @@ \section{Feature Extraction}
We incorporated six different stylometric features, each of which was composed into $n$-grams of varying sizes before being fed into the model:
\begin{enumerate}
\item Part-of-Speech (POS) tags (e.g., `Noun' for the word ``apple''). The Penn-Treebank tag set was used, and tagging was performed using a Maximum Entropy approach \citep{Ratnaparkhi}.
\item Etymological tags (e.g., `Old English' for the word ``great''). Etymological information was scraped from \textit{Webster's} Dictionary \citep{Dictionary}. As etymology is inherently root-based, words absent from the dataset were first stemmatized using the method of \citet{Porter} and lemmatized using the WordNet method of \citet{Fellbaum}. If either of these roots were present in the dictionary, their corresponding etymological tag was returned. Else, the entry with minimum Levenshtein distance \citep{Levenshtein} was used instead.
\item Etymological tags (e.g., `Old English' for the word ``great''), a relatively novel feature that captures the `formality' of the writing style. Etymological information was scraped from \textit{Webster's} Dictionary \citep{Dictionary}. As etymology is inherently root-based, words absent from the dataset were first stemmatized using the method of \citet{Porter} and lemmatized using the WordNet method of \citet{Fellbaum}. If either of these roots were present in the dictionary, their corresponding etymological tag was returned. Else, the entry with minimum Levenshtein distance \citep{Levenshtein} was used instead.
\item Syllables-per-word (i.e., `3' for ``continue"). Syllables were extracted from the CMU Pronouncing Dictionary \citep{Lenzo}. As with etymology, words absent from the dictionary were looked up by minimizing Levenshtein distance with the present keys.
\item Syllable counts, i.e., the total number of syllables between pieces of punctuation, along with the separating punctuation marks.
\item Syllable counts, i.e., the total number of syllables between pieces of punctuation.
\item Word counts, i.e., the total number of words between pieces of punctuation.
\end{enumerate}On top of these primitives, we also developed an algorithm to extract meter, which is outlined in Section~\ref{appendix:meter} of the Appendix. In total, this composed six stylometric features. For each document, we extracted these features and generated the relevant $2$-, $3$-, and $4$-grams (apart from meter, for which only $8$-grams were produced, as described in the Appendix).

Expand Down

0 comments on commit 17a3f93

Please sign in to comment.