Skip to content

Rutgers-Data-Science-Bootcamp/Amazon_Vine_Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Amazon_Vine_Analysis

Overview of the project__Bigdata

Analyzing Amazon reviews written by members of the paid Amazon Vine program. The Amazon Vine program is a service that allows manufacturers and publishers to receive reviews for their products. Companies pay a small fee to Amazon and provide products to Amazon Vine members, who are then required to publish a review.

Approach

use PySpark to perform the ETL process to extract the dataset, transform the data, connect to an AWS RDS instance, and load the transformed data into pgAdmin. Next, use PySpark to determine if there is any bias toward favorable reviews from Vine members in the selected dataset. Then, write a summary of the analysis to submit to company stakeholders.

Tools

Python PySpark and Pandas, AWS RDS and S3 services, SQL and pgAdmin

Data Source

Amazon Review datasets

Results

Total number of reviews

Vine reviews and percentage of 5-star reviews Screen Shot 2022-09-20 at 2 17 14 AM

non-vine reviews and percentage of 5-star reviews Screen Shot 2022-09-20 at 2 20 15 AM

Summary

Our data analysis result showed that 57% of the reviews in the Vine program were 5 stars reviews out of 285 total reviews whereas the percentage in the non-Vine program reviews is 46% out of 31545 reviews. Comparison of percentages reveales that a positivity bias for reviews in the Vine program. Additionally we could analyse one-way ANNOVA test for 5-star rating for the Vine and non-Vine reviews to see whether the percentage of difference is statistically significant or not.

About

AWS RDS service, pgADmin mySQL, PySpark, Google-Colab

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published