A pos-tagging library with Viterbi, CYK and SVO -> XSV translator made (English to Yodish) as part of my final exam for the Cognitive System course in Department of Computer Science. This README is a really bad translation of README_ita.md, made in nightly-build
mode, so please excuse me for typos. The entire project is fully commented using italian, as log and cli because I'm from Italy and I was a newbie. One day I will translate this, promised :D :D :D !!!
Run on Python v2. It has following dependencies:
- nltk==3.2.4
- numpy==1.13.1
- scipy==0.19.1
- six==1.10.0
To run the exercises run the following (eventually in virtualenv) from a shell:
git clone https://github.com/made2591/cognitive-system-postagger.git
cd cognitive-system-postagger
pip install -r requirements.txt
python main.py
The main program will guide you. Choose the number of exercise and configuration to run.
The default program use a logger, which shows the main steps of the execution of the various procedures. To change the configuration of the file locations, enable file logging, change the verbosity please refer to the config file.
The files with core algorithm of are:
- viterbi.py
- cky.py
- xsv.py
In the delivery / report folders / there is a copy of the delivery pdf (without course slides) and the relationships with the three exercises. The reports contain a detailed description of what has been implemented and the results obtained plus some reflection on the latter and some attempts to justify what was expected and what was not.
NOTE 1: Some of the config parameters do not have to be explained because they are required during run time. Non required parameters are listed within the config and mainly concern paths of files involved in the dumping of data structures and logging.
NOTE 2: Viterbi training, with certain configurations, can be very long. In the dump folder / there is a file named:
v.1.3.2.viterbi_training.lst
This file contains the training dump for Viterbi with configuration 1.3.2 (runtimes and reports explain what these numbers mean).
Viterbi execution is done with this configuration: if not explicitly requested the training is not re-executed. Instead, testing is performed at each run using the training uploaded / newly recreated. Testing does not take too much (over 400 phrases) but its execution is marked by messages that ensure its progress: in any case we are in the order of a couple of minutes.
Training, however, takes a lot to be run, especially to calculate the distribution of words that appear once only. In any case, you can always overwrite the training with a new file: for each run with configuration X.x.x, if explicit, the program overwrites (if any) the dump of the old training structure for a future reload of the same in a file named:
x.x.x.viterbi_training.lst
It is therefore possible to maintain more dump of configurations of Viterbi in the same folder.
NOTE 3: The CKY algorithm run using Viterbi output, assuming that there is a dump of training with configuration 1.3.2. That means that in order to function properly, the algorithm looks for the file v.1.3.2.viterbi_training.lst
in the dump folder mentioned before. CYK method used to to extract the PCFG (Probabilistic context free grammars) in CNF (Chomsky Normal Form) from the training set of the second exercise takes a few seconds to calculate the grammar. So it has not been set up dump a specific dump. Instead, some test dump files have been prepared for CKY: testing CKY takes lot of time, and with certain configurations it can take us many days to complete the tests. For this reason, the method which delegates the evalb evaluation program is able to recover dumps
of previously executed tests in the dump folder.
Precisely, in the dump root / we find 4 files: these are the result of the execution of CKY with Viterbi and without Viterbi on 110 sentences limiting themselves to long sentences with no more than 25 terms (about 55 sentences). Respectively, gld and tst are the gold and test files (generated by CKY) with and without viterbi (no and with prefix).
NOTE 4: In the corpora folder / we find the corpora used and the changes implemented for specific executions (CKY + Viterbi).
If something goes wrong, please write to matteo.madeddu@gmail.com