This project implements a sophisticated movie recommender system using various collaborative filtering techniques and machine learning algorithms. The system is built using Apache Spark and PySpark, leveraging the power of distributed computing for handling large-scale movie rating data.
-
Alternating Least Squares (ALS) Model
- Implements collaborative filtering using the ALS algorithm
- Utilizes cross-validation for hyperparameter tuning
- Evaluates performance using RMSE, MSE, and MAP metrics
-
Item-Item Collaborative Filtering
- Computes movie similarities using Pearson correlation
- Generates recommendations based on item similarities
-
Hybrid Recommender System
- Combines ALS and Item-Item CF predictions
- Incorporates a supervised learning component (Random Forest Regressor)
- Optimizes weights for different model components
-
Evaluation Metrics
- Root Mean Square Error (RMSE)
- Mean Square Error (MSE)
- Mean Average Precision (MAP)
The project uses the MovieLens 25M dataset, which includes:
- User ratings for movies
- Movie metadata (title, genres)
- Data Preparation: The dataset is split into training (80%) and test (20%) sets.
- ALS Model: Implemented using Spark's MLlib with hyperparameter tuning.
- Item-Item CF: Custom implementation of similarity computation and prediction generation.
- Hybrid Model:
- First version combines ALS and Item-Item CF
- Second version adds a Random Forest Regressor trained on movie genres
- Evaluation: Comprehensive evaluation using various metrics, with a focus on RMSE for the hybrid model.
The project demonstrates the effectiveness of hybrid approaches in recommendation systems:
- ALS Model RMSE: 0.845
- Hybrid Model (ALS + Item-Item CF) RMSE: 0.16
- Best Hybrid Model (ALS + Item-Item CF + Random Forest) RMSE: 0.43
- Apache Spark
- PySpark
- PySpark MLlib
- Python
- Ensure you have Apache Spark and PySpark installed.
- Download the MovieLens 25M dataset.
- Update the file paths in the
movie_recommender_system.py
script to point to your dataset location. - Run the script using spark-submit or in a PySpark environment:
- Incorporate more features like user demographics and movie metadata
- Experiment with deep learning models for recommendation
- Implement real-time recommendation updates
This project showcases the power of combining multiple recommendation techniques to create a robust and accurate movie recommender system.