For this step, you need to install the the following packages:
- To install the MosesToolkit, follow the installation guide on the official website.
Steps:
- Train Moses to generate a phrase table. Follow the training steps 1-6 or run the bash script
phrase_table.sh
. - Detokenize the phrase table. Moses replaces punctuation symbols with special characters such as " for apostrophe. These characters cannot be phonetized with G2P tools.
Experiment settings:
- Preprocessing: We chose not to remove sentences with 100 or more tokens.
- Training: In our experiments, we generate phrases with up to 5 tokens, but only use 3-grams in later steps.
- Remove phrases which contain non-alphabetical characters. These characters cannot be converted by G2P tools.
- Keep phrases with a number of tokens <= 3.
- Only retain phrases with inverted and direct translation probability >= 0.05.
- Keep only the top n=5 translations.
For this step, you need to install the the following packages:
Steps:
- Convert the source phrases to IPA strings. Phonetized tokens are not separated by whitespace.
- Remove suprasegmental symbols with the
IPAString
fromipapy
, since they cannot be featurized withWordkit
.
Note: G2P conversion to ARPABET symbols is implemented with g2p_en, but is not supported in PETS.
For this step, you need to install the the following packages:
Steps: Input: Transcription of source sentence, target sentence and machine translation output.
- Using the phonetized phrase table from step 2., find phrases in the source sentence, that do not have a a translation in the output sentence. We call these phrases candidates.
- Search phonetically similar phrases to the candidates with a modified Levenshtein distance.
- Calculate cosine similarity between the candidates and the phonetically similar phrases with Patpho.
- CVTransformer and ONCTransformer are Patpho implementations in
Wordkit
. - Put all the phonetically similar phrases on a CV (consonant vowel) grid.
- Vectorize the candidates and similar phrases.
- Calculate the cosine similarity
- CVTransformer and ONCTransformer are Patpho implementations in
- Retrieve the translations of phrases with a similarity greater than paramter
sim
from the phonetic table. We call these Phonetically Edited Translations (PETs). - Using a naive pattern-matching alignment method, check if the PET appears in the target sentence and is aligned to the source candidate.
Parameters:
max_dist
= Threshold for max edit operations with Levenshtein distance between source candidate and phrases.costs
= Costs for edit operations with Levenshtein distance. (deletion, insertions, substitutions). In our experiments, assign higher insertion costs only at the beginning and end of a string.sim
= Threshold for mininum cosine similarity score between source candidate and phrases.left
= Alignment on CV grid inPatpho
.n
= Maximum alignment distance between source candidate and PETs.
Experiment settings:
max_dist
= 0.6costs
=(1,2,1)
sim
= 0.7left
=True
n
=3