Supervised classification of textual reviews based on its sentiment into one of the five polarities:
- Strong negative
- Weak negative
- Neutral
- Weak Positive
- Strong Positive
- Text Pre-processing: The raw data was processed to convert it into a format that can be used for further processing. The following steps were applied:
- Case normalisation
- Tokenisation
- Lemmitization
- Feature Generation: Once the data was cleansed, relevant features were extracted from the it such as:
- Creation of N-grams
- Term and inverse document frequency
- Model : Logistic regression is the classifier used for determining the polarity of a review.
Datasets:
-
train_data.csv:
The training set consists of 650,000 product reviews.
-
train_label.csv:
This dataset contains the sentiment lables of the training dataset. The label set (1,2,3,4,5) refer to five polarity levels (strong negative, weak negative, neutral, weak positive, strong and positive) respectively.
-
test_data.csv:
The test set consists of 50,000 product reviews.
-
predicted_label.csv:
This dataset contains the predicted sentiment labels of the test data.