Predicting Sentiment on Social Media Data Using Supervised Approaches

Authors: Fadhil, M. Irfan Handarbeni, Mehrdad Darraji

The code is built using Keras and implemented on Jupyter Notebook.

Abstract

In the last couple of decades, the use of social media websites has been increased tremendously. Millions of people use it to express their views and opinions on a wide array of topics. As a result, social media websites generate large volumes of data. In this paper, we will discuss several sentiment analysis algorithms and compare the performances of those algorithms using Twitter data. We have implemented the sentiment analysis algorithm using Multinomial Naive Bayes, Support Vector Classification, and Convolutional Neural Networks. Data preprocessing is a crucial step in sentiment analysis, since selecting the right preprocessing methods can increase the performance of our sentiment analysis. Based on this view, this paper will also discuss various preprocessing methods that usually use on sentiment analysis and compare the role of those different preprocessing methods to the performance of our sentiment analysis. The data preprocessing and sentiment analysis algorithm are done using NLTK, Scikit-learn, and Keras libraries. Experimental results obtained demonstrate that Linear Support Vector Machine classifier gives very high predictive accuracy and outperform others. Result also prove that the right feature selection and representation on data preprocessing can affect the performance of our sentiment analysis.

Overview

The main purpose of this research is to find out which algorithm produces the best result in sentiment analysis. As mentioned earlier, we have done this using Deep Learning algorithm and some traditional machine learning algorithms such as Naive Bayes and Support Vector Machine. In this project, we also use the Convolutional Neural Network (CNN) architecture for our Deep Learning algorithm.

Data preprocessing is also a crucial step in sentiment analysis. We believe that selecting the right preprocessing methods will generate a good data representation that can increase the performance of our sentiment analysis. Based on this background, we further want to investigate the correlation between data and the performance of our algorithms. We will compare the effects of various data preprocessing on the results of our sentiment analysis algorithm. Here are some pre-processing methods that we use:

Removal of noisy data (username, URLs, special characters)
Contraction mapping
Spell Correction
Removal of Stop-words
Lemmatization and word-stemming

Exploratory Data Analysis

Here are some summary that we got from the dataset exploration:

There are around 36.373 tweets that only contains links or mentions
Both negative and positive are around the same number
We explored the frequency of the negative and positive words
- Words such as “bad, ” “miss,” and “sad” appear more frequently on negative class
- Words such as “thank,” “well,” and “awesome” appear more frequently on positive class
Calculate positivity and negativity of the top 5000 frequent words
- Words such as “excited”, “amazing”, and “happy” are highly positive
- Words such as “sick”, “sucks”, “sad”, and “ugh” are highly negative More detail exploratory data analysis contained in Project_EDA.ipynb

Dataset

Crawling the data from Twitter using Twitter API will be taking too much time because Twitter limits the number of tweets on each API call. Also, crawling data using web-scrapping approach will also be taking a while since we have to learn the HTML structure first. Hence, due to the limited time constraint for this project assignment, we decided to utilize ready-to-use social media dataset available on the internet.

1. Sentiment140

Sentiment140 is a product created by Stanford University graduates. Their work allows one to discover the sentiment of brands, products, or topics on Twitter. They recorded about 1.6 million tweets in the format of polarity, id of the tweet, date of the tweet, the query, user of the tweet, and the text of the tweet. Specifically for this project, we decided to split the data into 600 thousand training data and 1 million test data\cite{sentiment140}. Reference:

2. Amazon Reviews for Sentiment Analysis

4 million Amazon reviews split into 3.6 million for training and 400 thousand for testing purposes formatted by the polarity and the review of products from the website. Although for this project, we only used the 1.5 millions of the actual training data and split it into 500 thousand training data and 1 million test data. Reference:

Training

Before training, ensure that you already have the folder ~/dataset/ that contains:

train_cleaned_reviews.csv
test_cleaned_reviews.csv

Result

The best model for Twitter Sentiment Analysis is SVM Classifier using Linear Kernel with accuracy score 82% and AUC score 86%.
Proper data pre-processing and feature extractions are needed to produce optimal and efficient model.
Using trigram TF-IDF as feature extraction method performs best.
We do not manage to achieve our target to get accuracy above 90% on the Twitter dataset. On the other hand, we get above 90% accuracy on Amazon Review Dataset since it contains richer words and longer text.
The result of our model is above the average of other sentiment analysis systems for Twitter dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.ipynb_checkpoints		.ipynb_checkpoints
assets		assets
data		data
.DS_Store		.DS_Store
Project_report.pdf		Project_report.pdf
big.txt		big.txt
project_complete.ipynb		project_complete.ipynb
project_data_preprocessing.ipynb		project_data_preprocessing.ipynb
project_eda.ipynb		project_eda.ipynb
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting Sentiment on Social Media Data Using Supervised Approaches

Abstract

Overview

Exploratory Data Analysis

Dataset

1. Sentiment140

2. Amazon Reviews for Sentiment Analysis

Training

Result

About

Releases

Packages

Contributors 3

Languages

fadhilmch/big-data-project

Folders and files

Latest commit

History

Repository files navigation

Predicting Sentiment on Social Media Data Using Supervised Approaches

Abstract

Overview

Exploratory Data Analysis

Dataset

1. Sentiment140

2. Amazon Reviews for Sentiment Analysis

Training

Result

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages