- Dataset and Project Outline
- Setup
- Presentation
- Project Report
- IMDB scraping
- PostgreSQL connection
- NLP Pipeline
- Logistic Regression
- Random Forest Model
- Naive Bayes Model
- SVM Model
- Cross Validation
- Best Model
- Collaborators
For our project we’ve used sentiment analysis to classify reviews scraped of the IMDB website as either positive or negative using sentiment classification.
For our project we have explored the IMDB Review Dataset
Available from Kaggle.com.
The dataset provides 50k reviews, where 25,000 are postive reviews and 25,000 are negative.
- Our project uses NLP sentiment analysis to classify whether unlabelled reviews are positive or negative.
- First we will apply some pre-processing and carry out some initial data exploration, and determine the ratios of positive and negative reviews using jupyter notebook.
- We will import scraped imdb reviews data to an SQL database
- We will use Pyspark to create a natural language processing model, apply tokenizer and remove stop words.
- We will train the model on the kaggle dataset, then apply the unlabelled data we have scraped, to test whether the model is accurate.
We have used 2 CSV files in this data set:
- IMDB Dataset.csv
- new_upcoming_dvd_reviews.csv
CSV files are placed in the Resources folder.
- For the project we used google collab in order to run PySpark, which hasn't been installed in our local machines
- A downloaded copy of our collab notebook can be found in the repository main directory called Review_classification.ipynb
- To import we used this code, we ran this at the start of our collab file:
import os
spark_version = "spark-3.2.0"
os.environ['SPARK_VERSION']=spark_version
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://www.apache.org/dist/spark/$SPARK_VERSION/$SPARK_VERSION-bin-hadoop2.7.tgz
!tar xf $SPARK_VERSION-bin-hadoop2.7.tgz
!pip install -q findspark
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = f"/content/{spark_version}-bin-hadoop2.7"
import findspark
findspark.init()
The project presentation can be found in the /Presentation directory:
- imdb.pdf
The project report can be found in the /report directory:
- Project-Report.docx
-
Splinter and Beautiful Soup have been used to scrape reviews for the latest movie releases and append to our SQL database. We extracted the title of the movie, the URL and the review from the scraped html.
-
Two URL variables are defined to bookend each side of the scraped URL for each specific release, this will give us the full URL for the required page.
-
The FOR LOOP constructs the URL and navigates into that page to scrape the review.
-
Film title, URL, and review are added to a dictionary, and then appended to the film reviews list.
-
Film reviews list is transformed to Pandas DataFrame and then exported to CSV, called new_upcoming_dvd_reviews.csv
- The database will be created using PostgreSQL, and deployed using Heroku to have the database in the cloud.
- This will allow us to access the database in google colab, using sqlalchemy and psyopg2.
- We followed a basic pipeline of tokenization of the review, which split the sentence into individual words.
- The stop words that didn’t hold any sentimental value were removed.
- In future it may also be useful to remove punctuation, and use .lower() function to reduce repetition of words.
- Hashing TF was used to map individual words to an index.
- A disadvantage of using HashingTF is that two different words could get mapped to the same index, and if this was a crucial word to our sentiment analysis such as “good” and “bad” this could decrease the accuracy of the model.
- The output column contains either a 0 or 1 where 0 is a positive review and 1 is a negative review
- The data was split randomly into training and testing as a 70/ 30 split, in order to evaluate the performance of our machine learning models.
- More data was needed in the training data set to give a better accuracy calculated on the test set.
- For the first model, we used to train our model was logistic regression
- This model achieved an accuracy of 0.860, making it the second highest accuracy out of the four models used.
- The second model we used was the random forest model.
- Out of the four models we achieved the lowest accuracy score of 0.685.
- This model achieved an accuracy of 0.844
- The Naive Bayes model assumes that all predictors are independent where one feature in a class doesn’t affect the presence of another one, however in our dataset the reviews were dependant.
- Although the predictors weren’t completely independent in our case the model still produced a high accuracy rate.
- The high accuracy rate may be accounted to Naïve Bayes model being highly accurate after applying techniques like Stop Word removal, and TF-IDF.
- We used the SVM model which allows for more accurate machine learning because it’s multidimensional.
- The SVM model achieved the highest accuracy out of the four models (0.877)
- This model works well with a clear margin of separation between classes, which in our data was positive and negative.
- The two best performing models were used to perform k fold cross validation, which was the logistic regression model and the SVM.
- This was to reduce chances of overfitting.
- We used 5 as the number of folds and used the multiclassevaluator
- Following cross validation on the logistic regression model the accuracy was increased from 0.860 to 0.836
- Cross validation was also done on the SVM model as it was the model with the highest accuracy of 0.877
- After cross validation the accuracy slightly decreased to 0.872.
- This could be a reduction of overfitting in the SVM model.
- The SVM achieved the highest accuracy both before and after the cross validation so we decided to use it to pick the best model.
- This model was used to make the prediction on the unlabeled scraped review data.
- The best model accuracy was 0.872