This repository has been archived by the owner on Jul 15, 2018. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Data Science Capstone Text Prediction.Rpres
72 lines (63 loc) · 4.33 KB
/
Data Science Capstone Text Prediction.Rpres
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
Data Science Capstone Project - Text Prediction App
========================================================
author: Codrin Kruijne
date: 17/06/2018
autosize: true
Text prediction challenge
========================================================
* A simple text prediction web application was requested
* Text from Twitter, News sites and Blogs were provided
* A language model was trained based on observed word combinations (ngrams)
* The model was evaluated on a seperate testing dataset
* A Shiny web app was created that takes text input, applies the model and suggests a probable next word
* Model performance was calculated; higher ngram models predict better but are a strain on performance
* Model could be improved by better preprocessing and more sophisticated approaches to dealing with rare or unseen occurances
* Managing memory use and processing time was a challenge for these large amounts of trianing data and better code profiling needs to be performed
Predicting the next word; creating a language model
========================================================
* Language modeling was done by analysing ngram frequency
* Ngrams are combinations of words next to each other
* Given texts were preprocessed and arranged as ngrams
* Ngrams were split to base word(s) and following word
* For each ngram base found, the frequency of following words was counted
* Given ngram and follow word counts, Maximum Likelihood Estimates for these final words were calculated
* Ordered lookup tables were created for the Shiny app with base, MLE and following
Language model creation: Preprocessing and Training
========================================================
* Preprocessing
+ To be able to train the model we needed to deal with special cases
+ Contracted words, abbreviations, numbers, and some symbols (e.g.) were replaced with their word written version
+ Dates, measurements, smileys, unicode characters and underscores were removed, as well as punctioation, Twitter abbreviations, superfluous whitespace, etc.
+ All text was then tokenized first into sentences and then into seperate words.
+ Bigrams, trigrams, fourgrams and fivegrams were extracted from these tokens sequences in sentences.
+ Cases with ngrams only observed once were removed (pruning).
+ Basic add-one or Laplace smoothing was applied to shift some probability to the less observed cases.
Language model performance; Perplexity and Accuracy
========================================================
* Model performance
+ Perplexity is a measure that expresses the how well a model can predict a test set; lower is better
```{r echo=FALSE, message=FALSE, warning=FALSE, echo=FALSE}
test_results <- readRDS("WordPredictor/data/model_perplexities.rds")
knitr::kable(test_results, caption = "Language model performance: Perplexity on test set")
```
+ Working on accuracy measurements for first, second and third word prediction, but computation still takes too long
* Model improvement
To improve the effectiveness of language model we could:
+ Improve preprocessing by removing more special situations like other languages, measurements, etc.
+ Use more sophisticated smoothing in language model like Good-Turing or Kneser-Ney
+ Add a background lanuage model to deal with words missing in training text
+ Create seperate genre models for types of text: people probably write differently for Twitter than for blogs
Shiny web application and further development
========================================================
* How does it work?
+ The Shiny web app allows users to put in some text
+ The input text is cleaned from contractions, numbers, abbreviations, etc. (in the same way as with model making and testing)
+ The app then parses the text and extracts the final words (max 4) as input
+ The app looks up the input in the lookup tables from the language models
+ If no result is found in a higher order table (e.g. fivegram), it will "back off" to a lower order model (fourgram, discounted) until a following word is found.
+ The app returns a prediction: the following word with the highest likelihood (MLE) given the input text
Try out the app here: <https://codrin.shinyapps.io/WordPredictor/> !!!
* Further app development
To improve code efficiency we could
+ Do more elaborate code profiling
+ Experiment with faster lookup strategies, e.g. by word instead of ngram.