Assist the client to determine if there is any bias toward favorable reviews from Vine members. I did so by picking a dataset from a set list and use PySpark to perform the ETL process to extract the dataset, transform the data, connect to an AWS RDS instance, and load the data into pgAdmin.
Tools/Programs/Languages used:
- Google Colab
- Python / PySpark
- AWS RDS
- ETL operations
- SQL
- pgAdmin
- Excel
- I used PySpark to import data from an S3 bucket and read it as a dataframe.
- Then, I cleaned up the data and filtered the data by the amount of reviews and stars given by Amazon Vine Members and Non-Amazon Vine Members.
- This led me to create 3 different tables.
- Afterwards, I connected to an AWS RDS (cloud database) instance and added each "clean" DataFrame to its corresponding table.
- Once I confirmed connection to the cloud database, I then linked AWS to my database in pgAdmin. I used basic SQL queries to make sure the data loaded correctly.
-
How many Vine reviews and non-Vine reviews were there?
- Total Vine Reviews = 613
- Total non-Vine Reviews = 64,968
-
How many Vine reviews were 5 stars? How many non-Vine reviews were 5 stars?
- Total Vine 5 star reviews = 222
- Total non-Vine 5 star Reviews = 30,543
-
What percentage of Vine reviews were 5 stars? What percentage of non-Vine reviews were 5 stars?
- Percentage Vine 5 star reviews = 0.7%
- Percentage non-Vine 5 star reviews = 99.3%
- I would conclude that there is no positivity bias for reviews in the Vine program as they are such a small percentage of the total reviews.
- Only about a third of the total vine reviews had 5 star reviews.