"""
Author Information(16 Nov. 2020):
Superviser: Prof.Dr.Goran Glovas
Author: Shenghan ZHANG
E-mail: shezhang@mail.uni-mannheim.de
"""
There are some general library requirements for the project and some which are specific to individual methods. The general requirements are as follows.
numpy
pandas
bs4
urllib
re
tqdm
pickle
gensim
nltk
sklearn
spellchecker
operator
tensorflow
keras
matplotlib
keras_self_attention
According to Preprocessing, word2vec and glove are both not able to convert emoji or emoticon into vector. But in sentiment analysis, emoji and emoticon are strong expression for stating emotion. Thus, I searched for authoriative explaination of emoji and emoticon and got 2 links for explaining the exact meaning of emoji/emoticon:
https://en.wikipedia.org/wiki/List_of_emoticons
https://unicode.org/emoji/charts/emoji-list.html
For the purpose of crawling down all the information provided by these 2 links, I built 2 py files namely web_crawl_emoji.py
and web_crawl_emoticon.py
.
web_crawl_emoji.py
is for link from https://unicode.org/emoji/charts/emoji-list.html
, while web_crawl_emoticon.py
is for https://en.wikipedia.org/wiki/List_of_emoticons
.
After using web_crawl_emoji.py
, we can have emoji_meaning.csv
, which demonstrates the exact meaning of certain emoji within the dataframe.
After using web_crawl_emoticon.py
, we can have emo_meaning.csv
, which demonstrates the exact meaning of certain emoji within the dataframe.
And web_crawl_merge.py
is for emerging the dataframes built by web_crawl_emoji.py
and web_crawl_emoticon.py
. And it built emo_collection.csv
as its production.
emo_collection_glove.csv
is built by transforming the text of meaning to the requirement of GloVe as much as possible.
emo_collection_word2vec.csv
is built by transforming the text of meaning to the requirement of Word2Vec as much as possible.
Since vector embedding is not the topic of this project, I simply use pretrained word2vec resource and pretrained GloVe resource for transforming certain words into vectors.
For word2vec, I use the lexicon called GoogleNews-vectors-negative300.bin.gz
provided by "Google Archive", and here is the link below:
https://code.google.com/archive/p/word2vec/
Please download it and save it in DataSet archive.
For GloVe, I use the lexicon called glove.840B.300d.txt
provided by Jeffrey Pennington, Richard Socher, Christopher D. Manning, and here is the link below:
https://nlp.stanford.edu/projects/glove/
Please download it and save it in DataSet archive.
And for easy use, I built process_glove_wordsonly.py
to process lexicon glove.840B.300d.txt
, in order to gather its distinct vocabulary and saved it as list type glove.840B.300d_words.pkl
.
Here I will explain the function of each coding file according the process of sentiment analysis prediction.
The name of the code file is Preprocessing_TrainATest_TFIDF.ipynb
, which is for preprocessing of train data into tf-idf vectors and apply the same preprocessing procedure produced by train data to test data. These procedures include lowering the words in text, seperate the adhered words combination such as "you?" into "you ?", remove emoji/emoticons, remove stopwords, remove punctuations, correct missepll and stemming.
All these procedures were done based on the number of distinct vocabulary of all texts from train data. E.g. "crazzzzy" is same as the word "crazy", if without correction of mispell, they will be taken as 2 different distinct words. So the lower number of the distinct vocabulary of train data, the better the preprocessing is. Especially for TF-IDF word embedding, because the number of vocabulary is also the number of dimension of the vector for one sentence, which means lower the vocabulary can ease the job for model training.
The original number of distinct words vocabulary is 33,024, while after preprocessing, the number turned into 8,399, which is a large improvement. And I also found out setting different coefficients set can affect the performance of result. Therefore, I built 5 versions of baseline models with classifiers Logistic Regression and SVM.
file name | data version | co1 | co2 | co3 |
---|---|---|---|---|
LogisticRegression_baselinemodel_overall.ipynb | overall | 1 | 1 | 1 |
LogisticRegression_baselinemodel_delete2ndturn.ipynb | no2 | 1 | 0 | 1 |
LogisticRegression_baselinemodel_double1st3rdturn.ipynb | double13 | 2 | 1 | 2 |
LogisticRegression_baselinemodel_3times3rdturn.ipynb | time3 | 2 | 1 | 3 |
LogisticRegression_baselinemodel_del2big3.ipynb | del2big3 | 1 | 0 | 2 |
SVM_baselinemodel_overall.ipynb | overall | 1 | 1 | 1 |
SVM_baselinemodel_delete2ndturn.ipynb | no2 | 1 | 0 | 1 |
SVM_baselinemodel_double1st3rdturn.ipynb | double13 | 2 | 1 | 2 |
SVM_baselinemodel_3times3rdturn.ipynb | time3 | 2 | 1 | 3 |
SVM_baselinemodel_del2big3.ipynb | del2big3 | 1 | 0 | 2 |
It turned out working good on baseline model, but only the performance promoted by lowering the coefficient of co2. Since TF-IDF can't preserve the order of words in text, I import the coefficients setting into formal model building. And I built LogisticRegression_compareFocus_3turn_word2vec.ipynb
, LogisticRegression_compareFocus_3turn_glove.ipynb
, SVM_compareFocus_3turn_glove.ipynb
, SVM_compareFocus_3turn_word2vec.ipynb
these 4 files to try different coefficients sets in formal model building with word2vec and GloVe word embedding.
It's because I'm using pretrained word embedding lexicon, I have no idea which word with which format can be represented into vector correctly according the word embedding lexicon I imported. That's why I built 6 different .ipynb
files for preprocessing data. Here are the names of them below:
Preprocessing_EmojiEmoticon_GloVe.ipynb
Preprocessing_EmojiEmoticon_word2vec.ipynb
Preprocessing_TrainData_GloVe.ipynb
Preprocessing_TrainData_Word2vec.ipynb
Preprocessing_TestData_GloVe.ipynb
Preprocessing_TestData_Word2vec.ipynb
Based on their names, you can see that I do preprocessing based on different word embedding, though the procedures inside the file are the same. I will explain the procedures below:
Since the label of original data are formed by 4 different textual categories namely "happy, sad, angry, others", I need to transfer them into categorical numbers for classification. The order is: {"others":0,"happy":1,"sad":2,"angry":3}
, and save them as train_y.pkl
and test_y.pkl
, which are from train dataset and test dataset.
Though there are 3 turns of texts provided by train and test dataset, actually I only need one text for each row as train or test dataset. So here is the part I combine 3 turns' texts together as one. Besides, since I need to extract emoji/emoticon later, and their positions in text are also important, I still keep the 3 turns' texts.
In this part, I extracted emoji/emoticon from each text based on emo_collection_glove.csv
and emo_collection_word2vec.csv
according to different word embedding I'm using. These 2 dataframes are built also by preprocessing based on GloVe and Word2Vec, this is because I need to substitute the emoji/emoticon with its meaning, and preprocess the meaning for changing meaning into piles of vectors.
- e.g. "but I'm in love with it 😁😁" -> "but I'm in love with it " & [2,😁]
- explaination: 2 means this emoji/emoticon happened in text 2 times.
The extraction of emoji/emoticon are later stored as .pkl
file with names such as "extraction_emo_3_test_glove.pkl"
In this part, I lower the words in text.
- e.g. "Just for time pass" -> "just for time pass"
Some emoji/emoticon may not be covered by crawled emoji/emoticon explaination dataframe, therefore, I need to clean them out here. And also some words sticked together can make word embedding hard to extract them exactly.
- e.g. "yess i am crazyyy😂😂" -> "yess i am crazyyy 😂 😂" -> "yess i am crazyyy "
- e.g. "when did i?" -> "when did i ?" here "i?" should be seperated into "i ?"
There are some words hard to explain in word embedding but necessary to explain, such as "don't", "isn't". Because for example "isn't happy" is on the opposite side of "is happy". Besides after step (5) the words within punctuation are seperated with space, which also need correction.
- e.g. "he isn ' t happy" -> "he is not happy"
While checking the dataset in train data, I found out there are a lot of this kind of situation happend in text: "hmmm i can talk no w" or "this is wel l known". So I add function to correct the situation.
Punctuations are not able to be transfered into vector by word2vec and mostly useless in sentiment analysis. Therefore, I just delete them.
Since stopwords make sense in sentence with certain order, I only remove those stopwords which can't be transfered into vector due to the limit of word embedding.
Till now, I have the first version of data - data without correction of misspell and translation of shorthand words, and also without emoji/emoticon values inside. I call them "train0" and "test0" through word2vec, "train3" and "test3" through GloVe.
By using pyspellchecker
, I can correct the words such as "crazzzy", "amazzzzing". And some words are not able to be translated even via spellchecker, therefore, I translate them hand-crafted.
Till now, I have the second version of data - data with correction of misspell and translation of shorthand words but without emoji/emoticon values. I call them "train1" and "test1" through word2vec, "train4" and "test4" through GloVe.
All the process I did above are based on the calculation of coverage of data based on word2vec or GloVe. These information is in check data format part
. I firstly extract distinct vocabulary of all words in data, and then check how much percentage of the vocabulary can be transfered into vector by word2vec or GloVe.
And after all those process, I can have coverage of data based on word2vec below:
In Embedding Index we have 88.91% coverage of distinct vocabulary
And we have 99.52% coverage of all text
The number of words which are not covered in word2vec resource is: 1249
And based on GloVe, the final result is below:
In Embedding Index we have 95.11% coverage of distinct vocabulary
And we have 99.85% coverage of all text
The number of words which are not covered in word2vec resource is: 551
After preprocessing of emoji/emoticon, i built VectorBuilding_emo_glove.ipynb
and VectorBuilding_emo_word2vec.ipynb
, for the purpose of transfering emoji/emoticon into vectors based on pre-trained word-embedding resources.
In VectorBuild_word2vec.ipynb
and VectorBuild_glove.ipynb
, I transfered the preprocessed data into vectors for further classification.
While transfering the vectors, I calculate each emoji/emoticon vector by taking the average vector of vectors transfered from its meaning. E.g. [2,"grin smile"]
-> 2 vector of average vector of word "grin" and word "smile". And I add these emoji/emoticon back to where they were before. Therefore, I have 3rd version of data. They are called "train2" and "test2" based on word2vec, "train5" and "test5" based on GloVe.
data name | data version | word embedding | characteristics |
---|---|---|---|
train_df_tfidf.csv | _baseline | tf-idf | vectors of train data with correttion of misspell and stemming, but without emoji/emoticon |
test_df_tfidf.csv | _baseline | tf-idf | vectors of train data with correttion of misspell and stemming, but without emoji/emoticon |
vec_train_a_no_cor_word2vec.pkl | 0 | word2vec | vectors of train data without correttion of misspell, emoji/emoticon |
vec_train_a_no_emo_word2vec.pkl | 1 | word2vec | vectors of train data without emoji/emoticon vectors |
vec_train_a_emo_word2vec.pkl | 2 | word2vec | vectors of train data with emoji/emoticon and correction of misspell |
vec_train_a_no_cor_glove.pkl | 3 | GloVe | vectors of train data without correttion of misspell, emoji/emoticon |
vec_train_a_no_emo_glove.pkl | 4 | GloVe | vectors of train data without emoji/emoticon vectors |
vec_train_a_emo_glove.pkl | 5 | GloVe | vectors of train data with emoji/emoticon and correction of misspell |
vec_test_a_no_cor_word2vec.pkl | 0 | word2vec | vectors of test data without correttion of misspell, emoji/emoticon |
vec_test_a_no_emo_word2vec.pkl | 1 | word2vec | vectors of test data without emoji/emoticon vectors |
vec_test_a_emo_word2vec.pkl | 2 | word2vec | vectors of test data with emoji/emoticon and correction of misspell |
vec_test_a_no_cor_glove.pkl | 3 | GloVe | vectors of test data without correttion of misspell, emoji/emoticon |
vec_test_a_no_emo_glove.pkl | 4 | GloVe | vectors of test data without emoji/emoticon vectors |
vec_test_a_emo_glove.pkl | 5 | GloVe | vectors of test data with emoji/emoticon and correction of misspell |
According to reference research, I built 10 models for training:
Logistic Regression
SVM
Kernel SVM
CNN
LSTM
GRU
BiLSTM
BiGRU
BiLSTM_self-attention
BiGRU_self-attention
Since it's much easy to explain them in coding, please check them directly in my coding files.
While running these models, I also save the models, their predictions and scores in archive DataSet
and Prediction
, if you don't have much time running the models, you could directly load them in the coding file to check the result. I tried to save models in DataSet
as well, but due to the limit of uploading, I can only save deep learning models, which you can also directly use in coding files.
For evaluation, I calculate the accuracy, precision, recall, f1 score for evaluating the result from different version of data under same model. And I also built functions for getting ROC Curve, AUC and P-R Curve, average precision for detailed comparison.
In deep learning models such as CNN, LSTM, GRU, I also built learning curve to show the history of accuracy change and loss change in each epoch.
All these pictures of curves can also be found in archive Pictures
.
For further comparison to understand which version of data outperformed in same model comparing with other versions of data, and also for the purpose of understanding which model outperformed other models, I built this coding file named Evaluation.ipynb
.
In 2nd part "Evaluation on accuracy, precision, f1 score", I built the dataframe to contain all the metrics of evaluations of different data under different model, which is saved as df_evaluation.xlsx
. The reason why I saved the dataframe as xlsx, it's because it's easy to produce charts demonstrating the comparison of each metrics. You can check them in the xlsx file.
In 3rd part "Evaluation with ROC curve, AUC on same model", I built functions for gathering tpr, fpr, auc, and precision, recall, etc. By using these data, I built charts for showing the performance of different data under same model with different standards, and charts for showing the performance of same data under different model with different standards.
Here I didn't use P-R curve to demonstrate the performance of each model, since AUC can be better for illustrating how good a model or a version of data is.