The following project uses the Naive Bayes algorithm to classify tweets related to US airline services as either having positive, negative or neutral sentiment. The project uses the following Dataset which can be downloaded from Kaggle. The twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service"). The raw twitter data is first prepared by converting to a corpus. This involves cleaning the data by removing special characters, removing common stopwords, and removing the names of the tagged airlines to avoid as much bias being introduced to the model as possible. Once the data is cleaned, the data is split into three sets: training, validation, and final test. The Naive Bayes model is trained on the training dataset. Afterwards, the validation dataset is used to test the model and improve upon it. Once the model has been improved, it is then applied to the final dataset.
The machine learning project is split into two classification problems. The first classification project involves training the Naive Bayes algorithm to classify a tweet's sentiment as either positive, negative or neutral. After training and improving the model, the final test results in 77.42% of tweets being classified correctly.
The second classification project involves only training the Naive Bayes algorithm to classify a tweet's sentiment as either positive or negative. Before cleaning the data, tweets labeled as having neutral sentiment are removed from the dataset. After training and improving the model, the final test results in 91.61% of tweets being classified correctly. As expected, the accuracy increases as neutral tweets introduce more grey area in sentiment, especially since the Naive Bayes model relies heavily on the frequency of words appearing within each sentiment classification.
This project provides a great opportunity to work with a simple yet effective machine learning model as well as the opportunity to work with textual data scraped from Twitter. Textual data often involves more preparation than the usual numerical datasets used in machine learning projects. However, because the model makes naive assumptions about the data such as failing to recognize the context a word such as "doesn't" provides, other models may prove more effective such as deep learning models. Nevertheless, the Naive Bayes model provides a simple yet effective machine learning model that can be implemented in problems that may not necessarily suffer in performance due to the model's naive assumptions regarding the count of a word.