Phonetically Edited Translations (PETs)

Step 1: Create phrase table with Moses

For this step, you need to install the the following packages:

To install the MosesToolkit, follow the installation guide on the official website.

Steps:

Train Moses to generate a phrase table. Follow the training steps 1-6 or run the bash script phrase_table.sh.
Detokenize the phrase table. Moses replaces punctuation symbols with special characters such as " for apostrophe. These characters cannot be phonetized with G2P tools.

Experiment settings:

Preprocessing: We chose not to remove sentences with 100 or more tokens.
Training: In our experiments, we generate phrases with up to 5 tokens, but only use 3-grams in later steps.

Step 2: Filter phrase table

Remove phrases which contain non-alphabetical characters. These characters cannot be converted by G2P tools.
Keep phrases with a number of tokens <= 3.
Only retain phrases with inverted and direct translation probability >= 0.05.
Keep only the top n=5 translations.

Step 3: Phonetize phrase table

For this step, you need to install the the following packages:

Epitran with lex_lookup.
ipapy

Steps:

Convert the source phrases to IPA strings. Phonetized tokens are not separated by whitespace.
Remove suprasegmental symbols with the IPAString from ipapy, since they cannot be featurized with Wordkit.

Note: G2P conversion to ARPABET symbols is implemented with g2p_en, but is not supported in PETS.

Step 4: Find Phonetically Edited Translations (PETs)

For this step, you need to install the the following packages:

Steps: Input: Transcription of source sentence, target sentence and machine translation output.

Using the phonetized phrase table from step 2., find phrases in the source sentence, that do not have a a translation in the output sentence. We call these phrases candidates.
Search phonetically similar phrases to the candidates with a modified Levenshtein distance.
Calculate cosine similarity between the candidates and the phonetically similar phrases with Patpho.
- CVTransformer and ONCTransformer are Patpho implementations in Wordkit.
- Put all the phonetically similar phrases on a CV (consonant vowel) grid.
- Vectorize the candidates and similar phrases.
- Calculate the cosine similarity
Retrieve the translations of phrases with a similarity greater than paramter sim from the phonetic table. We call these Phonetically Edited Translations (PETs).
Using a naive pattern-matching alignment method, check if the PET appears in the target sentence and is aligned to the source candidate.

Parameters:

max_dist = Threshold for max edit operations with Levenshtein distance between source candidate and phrases.
costs = Costs for edit operations with Levenshtein distance. (deletion, insertions, substitutions). In our experiments, assign higher insertion costs only at the beginning and end of a string.
sim = Threshold for mininum cosine similarity score between source candidate and phrases.
left = Alignment on CV grid in Patpho.
n= Maximum alignment distance between source candidate and PETs.

Experiment settings:

max_dist = 0.6
costs = (1,2,1)
sim = 0.7
left = True
n= 3

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
experiments_output		experiments_output
pets		pets
zero		zero
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Phonetically Edited Translations (PETs)

Step 1: Create phrase table with Moses

Step 2: Filter phrase table

Step 3: Phonetize phrase table

Step 4: Find Phonetically Edited Translations (PETs)

About

Releases

Packages

Languages

staehlmich/Phonetically-Edited-Translations

Folders and files

Latest commit

History

Repository files navigation

Phonetically Edited Translations (PETs)

Step 1: Create phrase table with Moses

Step 2: Filter phrase table

Step 3: Phonetize phrase table

Step 4: Find Phonetically Edited Translations (PETs)

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages