Data Science Capstone Text Prediction.Rpres

Data Science Capstone Project - Text Prediction App
========================================================
author: Codrin Kruijne
date: 17/06/2018
autosize: true

Text prediction challenge
========================================================

* A simple text prediction web application was requested
* Text from Twitter, News sites and Blogs were provided
* A language model was trained based on observed word combinations (ngrams)
* The model was evaluated on a seperate testing dataset
* A Shiny web app was created that takes text input, applies the model and suggests a probable next word
* Model performance was calculated; higher ngram models predict better but are a strain on performance
* Model could be improved by better preprocessing and more sophisticated approaches to dealing with rare or unseen occurances
* Managing memory use and processing time was a challenge for these large amounts of trianing data and better code profiling needs to be performed

Predicting the next word; creating a language model
========================================================
* Language modeling was done by analysing ngram frequency
* Ngrams are combinations of words next to each other
* Given texts were preprocessed and arranged as ngrams
* Ngrams were split to base word(s) and following word
* For each ngram base found, the frequency of following words was counted
* Given ngram and follow word counts, Maximum Likelihood Estimates for these final words were calculated
* Ordered lookup tables were created for the Shiny app with base, MLE and following

Language model creation: Preprocessing and Training
========================================================
* Preprocessing
  + To be able to train the model we needed to deal with special cases
  + Contracted words, abbreviations, numbers, and some symbols (e.g.) were replaced with their word written version
  + Dates, measurements, smileys, unicode characters and underscores were removed, as well as punctioation, Twitter abbreviations, superfluous whitespace, etc.
  + All text was then tokenized first into sentences and then into seperate words.
  + Bigrams, trigrams, fourgrams and fivegrams were extracted from these tokens sequences in sentences.
  + Cases with ngrams only observed once were removed (pruning).
  + Basic add-one or Laplace smoothing was applied to shift some probability to the less observed cases.

Language model performance; Perplexity and Accuracy
========================================================
* Model performance
  + Perplexity is a measure that expresses the how well a model can predict a test set; lower is better
```{r echo=FALSE, message=FALSE, warning=FALSE, echo=FALSE}
test_results <- readRDS("WordPredictor/data/model_perplexities.rds")
knitr::kable(test_results, caption = "Language model performance: Perplexity on test set")
```
  + Working on accuracy measurements for first, second and third word prediction, but computation still takes too long

* Model improvement  
To improve the effectiveness of language model we could:  
  + Improve preprocessing by removing more special situations like other languages, measurements, etc.
  + Use more sophisticated smoothing in language model like Good-Turing or Kneser-Ney
  + Add a background lanuage model to deal with words missing in training text
  + Create seperate genre models for types of text: people probably write differently for Twitter than for blogs

Shiny web application and further development
========================================================
* How does it work?
  + The Shiny web app allows users to put in some text
  + The input text is cleaned from contractions, numbers, abbreviations, etc. (in the same way as with model making and testing)
  + The app then parses the text and extracts the final words (max 4) as input
  + The app looks up the input in the lookup tables from the language models
  + If no result is found in a higher order table (e.g. fivegram), it will "back off" to a lower order model (fourgram, discounted) until a following word is found.
  + The app returns a prediction: the following word with the highest likelihood (MLE) given the input text

Try out the app here: <https://codrin.shinyapps.io/WordPredictor/> !!!

* Further app development  
To improve code efficiency we could  
  + Do more elaborate code profiling
  + Experiment with faster lookup strategies, e.g. by word instead of ngram.