MovieLens 100k Dataset Analysis using Cassandra and Spark 🚀⚡️

A repository for Project 3 Data Management - utilising Cassandra Query Language and SQL

📊 Introduction

This project analyzes the MovieLens 100K Dataset using Cassandra and Spark. The dataset, collected by the GroupLens Research Project at the University of Minnesota, consists of 100,000 ratings from 943 users on 1,682 movies. The analysis uses both Cassandra Query Language (CQL) and Spark2 Structured Query Language (SQL) to answer key questions about user demographics and movie ratings.

🎯 Project Goals

The analysis in this project is focused on answering the following questions:

Calculate the average rating for each movie.
Identify the top ten movies with the highest average ratings.
Find users who have rated at least 50 movies and identify their favorite movie genres.
Find all users who are less than 20 years old.
Find all users whose occupation is "scientist" and whose age is between 30 and 40 years old.

⚙️ Prerequisites

To run this project, you'll need the Docker installed. The project uses a containerized environment with a custom Docker network to connect Apache Zeppelin and Apache Cassandra. The setup is included in the json file.

💻 Technologies Used

Docker 🐳: Containerization platform.
Apache Spark2 ⚡️: Distributed data processing engine.
Apache Cassandra 💾: NoSQL database for scalable data storage.
Apache Zeppelin 📓: A web-based notebook for interactive data analytics.
Python 🐍: The programming language for the wrapper script.

💡 Instructions

1. Download the Dataset

First, you need to download the MovieLens 100K Dataset from the official GroupLens website: https://grouplens.org/datasets/movielens/

2. Prepare the Environment

Make sure you have Docker installed and running. This project uses a Docker-based setup for Zeppelin and Cassandra.

3. Run the Python Script

The core of this project is a Python script that acts as a wrapper. It orchestrates the entire data flow and analysis process.

The script performs the following tasks:

Library Import: Imports necessary Python libraries for Spark and Cassandra sessions.
Data Parsing: Parses the u.user, u.item and u.data files.
RDD and DataFrame Creation: Creates RDD objects from the data and converts them into Spark DataFrames.
Cassandra Integration: Writes the DataFrames into a new keyspace table in Cassandra.
Query Execution: Reads the data back from Cassandra into a new DataFrame and executes Spark2 SQL and CQL queries to answer the project goals.
Output: Displays the top ten results for each question.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
movielens.pdf		movielens.pdf
movielens_cassandra.json		movielens_cassandra.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MovieLens 100k Dataset Analysis using Cassandra and Spark 🚀⚡️

📊 Introduction

🎯 Project Goals

⚙️ Prerequisites

💻 Technologies Used

💡 Instructions

1. Download the Dataset

2. Prepare the Environment

3. Run the Python Script

About

Uh oh!

Releases

Packages

jarbaem/Data-Management-Project-3

Folders and files

Latest commit

History

Repository files navigation

MovieLens 100k Dataset Analysis using Cassandra and Spark 🚀⚡️

📊 Introduction

🎯 Project Goals

⚙️ Prerequisites

💻 Technologies Used

💡 Instructions

1. Download the Dataset

2. Prepare the Environment

3. Run the Python Script

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages