A neural network model for sentiment analysis of movie reviews using IMDb dataset. The model is built using PyTorch and BERT as the feature extractor.
Note: This README.md file contains an overview of the project, it is recommended to open notebook as it contains the code and further explanation for the results.
- The project needs a dataset for movies and TV shows reviews, IMDb is a popular website for movies and TV shows. It has a database of over 8 million movies and TV shows. Using a dataset from this website will be a good choice for the project to train our neural network and test it.
- Instead of using the whole dataset, we will use a subset of the dataset. The dataset contains 50,000 reviews for movies and TV shows. The dataset is already balanced, meaning that it contains an equal number of positive and negative reviews. The dataset is available on Kaggle.
- Since the dataset is already balanced, we will split the dataset into 70% training set, 20% validation set, 10% testing set . The training set will be used to train the neural network, validation set is used to further tune the hyperparameters and the testing set will be used to evaluate the neural network.
- Text pre-processing is essential for NLP tasks. So, you will apply the following steps on
our data before used for classification:
- Remove punctuation.
- Remove stop words.
- Lowercase all characters.
- Lemmatization of words.
- The data preprocessing is done using the NLTK library.
- The project uses PyTorch to build the neural network. The neural network is a simple feedforward neural network with 5 layers.
- The network's input layer takes in 768 inputs corresponding to the vector provided by BERT's pooled output (classification output)
- Our network consists of 4 hidden layers with 512, 256, 128, 64 units respectively.
- The hidden layers have ReLU activation function.
- The output layer have sigmoid activation function to classify the vector
- The network uses Adam optimizer and Binary Cross Entropy loss function.
- The model can be improved by using different hyperparameters and regularization techniques. The following techniques are used to improve the model:
- The following hyperparameters can be tuned:
- Learning Rate
- Batch Size
- Number of Epochs
- Number of Hidden Layers
- Number of Units in each Hidden Layer
- Activation Function
- Optimizer
- Loss Function
- We are only tuning the learning rate in this project since the other hyperparameters will have slight to no effect on the model's performance.
- You can find the model's performance for different learning rates in the results folder
- Dropout is a regularization technique that randomly drops out some of the neurons in the network. This technique is used to prevent overfitting.
- Dropout is applied to the hidden layers of the network. The dropout rate can be specified while initializing the network. The dropout rate is the probability of a neuron to be dropped out. The dropout rate is set to 0.4 in this project.
- The model is able to classify the reviews with 93% accuracy on raw test data. On the other hand, the accuracy reached 90% when using the preprocessed data. This indicates that ot all preprocessing steps are necessary for the model to perform well.
- The model's performance on the raw test set is as follows:
- Accuracy: 93%
- Precision: 90%
- Recall: 90%
- F1 Score: 90%
Note: See notebook for more details on the results.