Skip to content

aeronaut2001/Movie-Rating-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Movie-Rating-Data-Analysis

aeronaut2001

View My Profile View Repositories


Movie Rating Data Analysis using Apache Spark (pyspark)💥🐝

📝 Gain the skills

Languages and Tools:

Cloud:

gcp

Version Control System:

git

Programming Language - PYTHON:

python

BIG DATA TOOL AND SOFTWARES:

hadoop Apache Hive Apache Spark linux


📙 Project Structures :

  • Problem Statement:

  • The objective of this project is to analyze movie data using Hadoop, Spark, and Hive. We aim to derive insights from the data, such as ratings, tags, and movie details, to understand user preferences and popular trends in the world of movies.

  • Project Introduction:

  • Welcome to my movie data analysis project. To kick things off, I established a Hadoop cluster with Hadoop YARN, Hive, and Apache Spark. I accomplished this either using Docker for a local setup or on a cloud platform like AWS, GCP, or Azure.

  • Data Loading:

  • First and foremost, I loaded three crucial data files - movies.csv, ratings.csv, and tags.csv - into my Hadoop Distributed File System (HDFS).

  • Spark Data Analysis:

  • The heart of the project! I wrote Spark jobs to tackle specific analytical challenges in the movie data.

  • I showed the aggregated number of ratings per year.

  • I displayed the average monthly number of ratings.

  • I visualized the distribution of rating levels.

  • I identified the 18 movies that were tagged but not rated.

  • I found movies that had ratings but no tags.

  • For rated untagged movies with more than 30 user ratings, I displayed the top 10 movies in terms of average rating and number of ratings.

  • I calculated the average number of tags per movie in tagsDF and the average number of tags per user, comparing it with the average number of tags a user assigns to a movie.

  • I also identified the users that tagged movies without rating them.

  • I calculated the average number of ratings per user in the ratings DataFrame and the average number of ratings per movie.

  • I determined the predominant (frequency-based) genre per rating level.

  • I found the predominant tag per genre and the most tagged genres.

  • I identified the most predominant (popularity-based) movies.

  • Finally, I listed the top 10 movies in terms of average rating (provided more than 30 users reviewed them).

  • Data Storage:

  • At the end of each problem statement, I ensured that the output was stored neatly in a single CSV file with headers in the output HDFS path.

  • Key Takeaway:

  • This project embodies the essence of utilizing Hadoop, Spark, and Hive to extract valuable insights from movie data. It emphasizes the importance of data organization for further exploration, decision-making, and a better understanding of user preferences and trends in the movie industry.

About

Movie Rating Analysis using Apache Spark (pyspark)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published