Sentiment Analysis is a NLP (Natural Language Processing) problem to determine whether the sentiment is positive or negative. In this case, we use twitter's sentiment to deterimine whether is it positive, negative, neutral or irrelevant.
I built this application using several tools, libraries and frameworks. Especially for notebook I used Google Colab to help me built the model from notebook
- Tensorflow
- pandas
- seaborn
- matplotlib
- sklearn
- numpy
- zipfile
- html
- FastAPI
- Go to web directory in terminal
- Run the application using this command
uvicorn app:app
- Go to the link http://127.0.0.1:8000/
The dataset can be download in kaggle - Twitter Sentiment Analysis
There are 2 csv in this zipfile:
-
twitter_training.csv: for training model
-
twittter_validation.csv: for validation data
In this notebook I only use twitter_training.csv
1. Show first five records in dataset
From that picture, we can see that there are no columns name, therefore we add name for each columns. From adding columns name can help us explore this dataset easily.
2. We check shape of dataset that there are 74681 rows and 4 columns
3. Check missing values in the dataset
We found that there are 686 missing values in tweet_content column, therefore we need to handle it, in this case I remove them and got 73995 rows and 4 columns after removing the missing values
4. Drop unnecessary columns
I dropped tweet_id and entity columns because we did not need that.
5. Check label
I checked label and sum of the values to prevent imbalanced data and the data seems balanced after checked the label data.
1. One Hot Encoding
First thing that I did in preprocessing steps is one hot encoding the label data, because label data is categorical data and not numerical. To handle this problem I used pd.get_dummies() to one hot encoding label or sentiment column.
2. Change column into numpy array
To process the dataset we need to change each columns into numpy array to helps us tokenize them.
3. Split data
Then I splitted the dataset into train_tweet, test_tweet, train_label and test_label with train size 80% and test size 20% and random_state = 42. Then we got this shape:
4. Tokenizer
After that I am using tokenizer to tokenize train_tweet with num_words is 10000 and change unknown character with . After fit tokenizer into train_tweet, I am using tokenizer.texts_to_sequences() to change texts in train_tweet and test_tweet into sequences.
5. Add padding
To handle different length in each sequences I used pad_sequences in train_sequences and test_sequences to padding each sequence with parameter max_len = 150, padding='post' therefore the additional values or padding can be in the back of sequence, and truncating = 'post' therefore we crop sequence from back.
I build model with tensorflow and using Embedding layers and Bidirectional LSTM layers to help me train my model I used input_dim = 10000, output_dim = 16, and input_length = 150.
I am using callback and my callback stop the training in 9 epochs and I got 92% accuracy, 84% val_accuracy, 19% loss and 59% val_loss.