#Generalized Language Model Toolkit
The software can be used to compute a Generalized Language Model which is yet another mean to compute a Language Model. As shown in this publication Generalized Language models can outperform Modified Kneser Ney Smoothing by 10 to 25 % in Terms of perplexity.
git clone git@github.com:renepickhardt/generalized-language-modeling-toolkit.git
sudo chmod a+x mvn.sh
You will need to install maven in order to build the project.
sudo apt-get install maven2
You need to copy config.sample.txt to config.txt and read the instructions in config.sample.txt.
cp config.sample.txt config.txt
emacs config.txt
After you set all your directories in config.txt you can run the project
./mvn.sh
Since Generalized language models can become very large the software is written to use the hard disk. In this sense you can theoretically run the programm with very little memory. Still we recommend 16 GB of main memory for the large english wikipedia data sets.
We tried to avoid frequent disc hits. Still the programm will execute much faster if you store your data on a Solid State disk.
you need to have a file called normalized.txt
which serves as your input. This file should contain one sentence per line. You will learn language models based on this file.
Please refere to http://glm.rene-pickhardt.de/data in order to download preprocessed and formatted data sets.
If you whish to parse the data yourself (e.g. because you want to use a newer wikipedia dump) refer to https://github.com/mkrnr/lexer-parser
you have to start with a file called normalized.txt
which has to be stored in your data directory (according to config.txt
). mvn.sh
will compile the program and start the flow of the following steps (which can be configured by switching the fields ind config.txt
from true
to false
)
- splitting
normalized.txt
totraining.txt
andtesting.txt
according to the datasplit parameters inconfig.txt
- building a wordindex
index.txt
this index is used to split the language models into files of equal size - creating absolute counts and continuation counts in the directories
absolute
andcontinuation
** the various models are stored in folders like11111
meaning a regular 5 gram or11011
meaning a skipped 5 gram at the third position - creation of testing samples from
testing.txt
:testing-samples-4.txt
for example contains about 100k sequences of 4 words to be tested - calculating the D and N values for Modified Kneser Ney Smoothing and making them persistent in the two *.ser files (for speeding up various tests)
- running the experiments by creating files like
mod-kneser-ney-complex-backoffToCont-3.txt
: depending on your configuration the files could be named with asimple
instead ofcomplex
(complex meaning GLM, simple meaning LM). Exchanging the3
you can have different model lenghts. These files contain the testing samples with the log of their probabilities. - you have to manually calculate the entropy by running the python script as an argument you might want to pass
mod*.txt
: in this way you can calculate the entropy for all files and experiments.
If this software or data is of any help to your research please be so fair and cite the original publication which is also in the home directory of [this git repository](https://github.com/renepickhardt/generalized-language-modeling-toolkit/raw/master/A Generalized Language Model as the Combination of Skipped n-grams and Modified Kneser-Ney Smoothing.pdf). You might want to use the following bibtex entry:
@inproceedings{Pickhardt:2014:GLM,
author = {Pickhardt, Rene and Gottron, Thomas and Körner, Martin and Wagner, Paul Georg and Speicher, Till and Staab, Steffen},
title = {A Generalized Language Model as the Combination of Skipped n-grams and Modified Kneser Ney Smoothing},
year = {2014},
booktitle = {ACL'14: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics},
}
The Generalized Language models envolved from Paul Georg Wagner's and Till Speicher's Young Scientists project called Typology which I advised in 2012. The Typology project played around and evaluated an idea I had (inspired by the PhD thesis of Adam Schenker) of presenting text as a graph in which the edges would encode relationships (nowerdays known as skipped bi-grams). The Graph was used to produce an answer to the next word prediction problem applied to word suggestions in keyboards of modern smartphones. From the convincing results I developed the theory of Generalized Language models. Most of the Code was written by my student assistent Martin Körner who also created his bachlor thesis about the implementation of a preliminary vesion of the Generalized Language Models. This thesis is a nice reference if you want to get an understanding of modified kneser ney smoothing for standard language models. In terms of notation and building of generalized language models it is outdated.
If you have questions feel free to contact me via the issue tracker. on my blog or in the paper you could find my mail address.