You can find the dataset here
To classify tweets made by travelers in February 2015 as Neutral, Positive or Negative.
I used the random forest classifier as the problem dealt with a relatively large dataset. Random Forests are also great classifiers when it comes to dealing with a large number of features.
However, is there a way to get better results?
I decided to use an ANN with:
2 hidden layers (adding a third did not have a significant effect in this case)
Overall, the ANN resulted in better classifications with an accuracy ranging between
Finally, the results were not bad for the given dataset which contained many ambiguous/abbreviated tweets that would be difficult for a machine to interpret.
The steps taken were as follows:
- Get the Dataset
- Pre-process the text
- Create the Bag of Words Model
- Label Encode and OneHot Encode the Dependent Variable
- Split the data into Test and Training sets
- Train the Random Forest Classifier
- Get the Predicted values of test set
- Compare the predicted and test values and use a confusion matrix to calculate the accuracy of the model.
- Accuracy = (number of correct predictions on testing data / total number of testing data)
The steps taken were as follows:
- Get the Dataset
- Pre-process the text
- Create the Bag of Words Model
- Label Encode and OneHot Encode the Dependent Variable
- Split the data into Test and Training sets
- Add Layers to your ANN
- Compile the ANN
- Get the Predicted values of test set
- Compare the predicted and test values and use a confusion matrix to calculate the accuracy of the model.
- Accuracy = (number of correct predictions on testing data / total number of testing data)