This repository consists of project done as part of the course Natural Language Processing - Advanced, Spring 2014.The course was instructed by Dr. Dipti Misra Sharma, Dr. Ravi Jampani and Mr. Akula Arjun Reddy
A detailed report is available here
##Requirements
- Python 2.6 or above
- GIZA++
- Language Model (IRSTLM)
##Problem In this project, the phrase based model is implemented. A phrase based model is a simple model for machine translation that is based solely on lexical translation, the translation of phrases. This requires a dictionary that maps phrases from one language to another. We first find the alignment of the word. Next, using the bi-text corpus we train the model and calculate the translational probability. Along with the translation probabilities we use the language model to reflect fluency in English.
The source folder consists of the following methods:
###Main functions
- preprocess.py
This module takes as input the bi-text corpuses and the number of sentences. It returns the training and testing dataset along with the sentence pairs.
Run the following command to create a random set of x sentences:
python preprocess.py sourceCorpus targetCorpus numberOfSentencesForTraining
It will generate four files:
trainingSource.txt trainingTarget.txt testingSource.txt testingTarget.txt
trainingSource.txt, trainingTarget.txt: contains the given number of sentences
testingSource.txt, testingTarget.txt: contains 5 test sentences which we use later
Next run the word alignment tool, GIZA++ to obtain the alignments.
In order to run GIZA++ do the following:
./plain2snt.out trainingSource.txt trainingTarget.txt
./GIZA++ -s trainingSource.vcb -t trainingTarget.vcb -c trainingSource_trainingTarget.snt
If the previous step gives error, then do:
./snt2cooc.out trainingSource.vcb trainingTarget.vcb trainingSource_trainingTarget.snt > cooc.cooc
./GIZA++ -s trainingSource.vcb -t trainingTarget.vcb -c trainingSource_trainingTarget.snt -CoocurrenceFile cooc.cooc
This will generate several files. The word alignments are present in A3 file. Repeat this step by swapping the trainingSource.txt and trainingTarget.txt to get the other direction alignment.Let sourceAlignment.txt and targetAlignment.txt be the two files. Then we obtain the phrases as follows:
- phraseExtraction.py
This function reads two files generated by GIZA++ containing the alignment of the source to target and target to source and returns the all possible phrases associated with it. Run the following command to get the phrases:
python phraseExtraction.py sourceAlignment.txt targetAlignment.txt
The phrases are generated in the file phrases.txt. Next we calculate the translation probability.
- findTranslationProbability.py
After obtaining the consistent phrases from the phrase extraction algorithm we next move to find the translationProbability. This is done by calculating the relative occurrences of the target phrase for a given source phrase for both directions
Run the following command:
python findTranslationProbability.py phrases.txt
It will generate two files:
translationProbabilitySourceGivenTarget.txt
translationProbabilityTargetGivenSource.txt
- languageModelInput.py
This helps in formatting the input file to the language model. It removes all special characters. In order to run this we do the following:
python languageModelInput.py trainSource.txt trainS.txt
python languageModelInput.py trainTarget.txt trainT.txt
Create the zip file for this which is now input for the language model. It is run as follows:
./ngt -i="gunzip -c trainS.gz" -n=3 -o=train.www -b=yes
./tlm -tr=train.www -n=3 -lm=wb -o=trainS.lm
./ngt -i="gunzip -c trainT.gz" -n=3 -o=train.www -b=yes
./tlm -tr=train.www -n=3 -lm=wb -o=trainT.lm
- finalScore.py
After obtaining the translationProbability from the alignment matrix,it combines the translation probability from the language model and returns the findTranslationProbability.
Run the follwowing command for both directions:
python finalScore.py translationProbabilityTargetGivenSource.txt trainSource.lm
finalTranslationProbabilityTargetGivenSource.txt
python finalScore.py translationProbabilitySourceGivenTarget.txt trainTarget.lm finalTranslationProbabilitySourceGivenTarget.txt
It returns the file final Translation Probabilities
- stackDecoding.py
Once we obtain the final tranlation probabilites we obtain the best phrase translation. This function gives the translation for a given sentence based on hypothesis recombiniation. Run the following command:
python finalScore.py finalTranslationProbabilityTargetGivenSource.txt testingTarget.txt
python finalScore.py finalTranslationProbabilitySourceGivenTarget.txt testingSource.txt
###Helper Function:
- alignment.py
This is a helper function which generates the word alignment matrix for a pair of sentences.
###Error Analysis
The method errorAnalysis.py takes as input in a very specific format. Given the source sentence, the translated sentence and the actual translation separated by newline, it returns the precision and recall for the input file in evalution.txt