Simple Webspeak to English SMT model as a Discord bot.
The bot can be run with DISCORD_USER_IDS=<uid1,uid2,...> DISCORD_TOKEN=<token> python3 bot.py
.
Piemanese, is a form of webspeak spoken by my friend Pieman.
Some examples of Piemanese (First line Piemanese, English below):
i ges i cn liftu a beet ;-;
i guess i can lift a bit ;-;
i told u to pley it b4 >.<
i told you to play it before >.<
mai englando es too gud
my english is too good
Furthermore, some Piemanese words can be ambiguous and need to be determined by context.
Example of an ambiguous case: wan
wan u come
when you come
nani u wan
what you want
In contrast to "regular" webspeak, we can see that Piemanese contains far more spelling perturbations, such that a simple Levenshtein distance based spelling correction algorithm or replacement dictionary is insufficient to translate it back to regular English.
A more sophisticated approach is required; one that takes into account the following:
- How the spelling of a Piemanese word relates to its corresponding English word
- Context of the sentence
We approach this as machine translation problem, in other words we look to compute the following:
where is the set of all possible English sentences and is a Piemanese sentence.
By Bayes' theorem, we can rewrite this as:
We can then interpret the first term as a translation model and the second term as a language model.
- translation model: returns a high probability if is a good translation of , low probability if it is not.
- language model: returns a high probability if is a well-formed English sentence, lower if it is not.
Then, we use a decoding algorithm (since it is too expensive to go through all possible English sentences) to combine the two models together.
Normally, a translation model would consist of a set of parameters that is trained using an optimization algorithm on a parallel corpus, but since there is no Piemanese-English parallel corpus, we can't actually train our model in the traditional sense. Instead, we use an algorithmic solution for the translation model:
where are coefficients and PhonemeDistance
is a phonetic feature weighted Levenshtein distance (Mortensen et al, 2016) between the pronunciations of and , and GraphemeDistance
is a grapheme based Levenshtein distance between and that I defined here.
Essentially, this results in English words that are both phonetically and graphemically similar (have less distance) to the Piemanese word to have higher probabilities than those that are not (have greater distance).
To catch the exceptions, we also use a manually written Piemanese to English replacement dictionary before running it through the other components of the pipeline. This could also be viewed as an extension of the translation model.
We train a trigram language model with Laplace smoothing (using NLTK modules) on the TwitchChat corpus.
Since we expect this translation bot to be used in a casual Discord chat, the best representation of English should not be from formal/proper English, but rather casual English seen in live chat.
The language model will determine the highest probability word by taking into account the context of the sentence (previous two words for a trigram model). This will help resolve ambiguous situations where a Piemanese word may have multiple valid English translations.
We use a greedy decoding algorithm. In our case, Piemanese is simple enough that the words are generally aligned one-to-one with regular English, so beam search decoding is not necessary.
For each word, we add the translation model log score with the language model log score for all english words given the piemanese word, and pick the one with the highest log score as our best translation.