A repository for Project 3 Data Management - utilising Cassandra Query Language and SQL
This project analyzes the MovieLens 100K Dataset using Cassandra and Spark. The dataset, collected by the GroupLens Research Project at the University of Minnesota, consists of 100,000 ratings from 943 users on 1,682 movies. The analysis uses both Cassandra Query Language (CQL) and Spark2 Structured Query Language (SQL) to answer key questions about user demographics and movie ratings.
The analysis in this project is focused on answering the following questions:
- Calculate the average rating for each movie.
- Identify the top ten movies with the highest average ratings.
- Find users who have rated at least 50 movies and identify their favorite movie genres.
- Find all users who are less than 20 years old.
- Find all users whose occupation is "scientist" and whose age is between 30 and 40 years old.
To run this project, you'll need the Docker installed. The project uses a containerized environment with a custom Docker network to connect Apache Zeppelin and Apache Cassandra. The setup is included in the json file.
- Docker π³: Containerization platform.
- Apache Spark2 β‘οΈ: Distributed data processing engine.
- Apache Cassandra πΎ: NoSQL database for scalable data storage.
- Apache Zeppelin π: A web-based notebook for interactive data analytics.
- Python π: The programming language for the wrapper script.
First, you need to download the MovieLens 100K Dataset from the official GroupLens website: https://grouplens.org/datasets/movielens/
Make sure you have Docker installed and running. This project uses a Docker-based setup for Zeppelin and Cassandra.
The core of this project is a Python script that acts as a wrapper. It orchestrates the entire data flow and analysis process.
The script performs the following tasks:
- Library Import: Imports necessary Python libraries for Spark and Cassandra sessions.
- Data Parsing: Parses the
u.user,u.itemandu.datafiles. - RDD and DataFrame Creation: Creates RDD objects from the data and converts them into Spark DataFrames.
- Cassandra Integration: Writes the DataFrames into a new keyspace table in Cassandra.
- Query Execution: Reads the data back from Cassandra into a new DataFrame and executes Spark2 SQL and CQL queries to answer the project goals.
- Output: Displays the top ten results for each question.