A machine learning-based movie recommendation system that uses collaborative filtering through user and movie clustering. The system provides a REST API built with FastAPI for easy integration.
- Overview
- System Architecture
- Machine Learning Model
- API Documentation
- Installation
- Usage
- Data Requirements
- Project Structure
This recommendation system uses a dual clustering approach:
- User Clustering: Groups users with similar preferences
- Movie Clustering: Groups movies with similar characteristics
By matching a user's cluster with movies from similar users' preferred clusters, the system generates personalized recommendations.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FastAPI Server β
β ββββββββββββββ ββββββββββββββββ βββββββββββββββββββββββββ
β β Movies β β Users β β Recommendations ββ
β β Endpoint β β Endpoint β β Endpoint ββ
β ββββββββββββββ ββββββββββββββββ βββββββββββββββββββββββββ
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββ΄βββββββββββββββββ
β β
βββββββββΌβββββββββ βββββββββββΌβββββββββ
β User Clusteringβ β Movie Clustering β
β Model β β Model β
β (model_user) β β (model_movie) β
ββββββββββββββββββ ββββββββββββββββββββ
The system uses two independent machine learning models:
-
User Clustering Model (
model_user.jlp)- Clusters users based on their rating patterns, preferences, and genre interactions
- Features:
- Genre preferences (19 genres: Action, Adventure, Animation, etc.)
- Tag categories (Genre & Style, Themes & Tropes, Actors & Characters, Viewing & Production)
- Normalized rating behavior
-
Movie Clustering Model (
model_movie.jlp)- Clusters movies based on their attributes and characteristics
- Features:
- Genre classification (19 genres)
- Tag categories
- Rating statistics
- Relevance scores
Converts multi-genre labels into binary features:
Genres: "Action|Adventure|Sci-Fi"
β [Action: 1, Adventure: 1, Animation: 0, ..., Sci-Fi: 1, ...]Supported genres:
- Action, Adventure, Animation, Children, Comedy
- Crime, Documentary, Drama, Fantasy, Film-Noir
- Horror, IMAX, Musical, Mystery, Romance
- Sci-Fi, Thriller, War, Western
User-generated tags are categorized into 4 main groups:
- Genre & Style: Action-related, horror, comedy, etc.
- Themes & Tropes: Time travel, psychological, dystopia, etc.
- Actors & Characters: Director names, character types, etc.
- Viewing & Production: Watch context, production quality, etc.
Ratings are standardized using StandardScaler:
normalized_rating = (rating - mean) / std_devHandles edge cases:
- Missing values β filled with mean
- Zero variance β returns zeros
- Empty data β graceful handling
1. Load user profile β Extract features β Predict user cluster
2. Find all users in same cluster
3. Get movies watched by cluster members
4. For each movie:
- Extract movie features
- Predict movie cluster
5. Return movies from predicted clusters
6. Deduplicate by movieId
7. Apply pagination/limits
- Caching: User/movie data and cluster assignments cached in memory
- Pagination: Default limit of 50 movies to prevent large payloads
- Deduplication: Ensures unique movieId in recommendations
- Lazy Loading: Models loaded once on first request
http://localhost:8000
GET /movies?offset=0&limit=100Query Parameters:
offset(int, default: 0): Starting positionlimit(int, default: 100, max: 1000): Number of results
Response:
[
{
"movieId": 1,
"title": "Toy Story (1995)",
"genres": "Adventure|Animation|Children|Comedy|Fantasy",
"rating": 4.5,
...
},
...
]Example:
curl "http://localhost:8000/movies?offset=0&limit=10"GET /users?offset=0&limit=100Query Parameters:
offset(int, default: 0): Starting positionlimit(int, default: 100, max: 1000): Number of results
Response:
[
{
"userId": 1,
"movieId": 123,
"rating": 4.0,
"genres": "Action|Thriller",
...
},
...
]Example:
curl "http://localhost:8000/users?offset=0&limit=10"POST /movies/{user_id}?limit=50&users_limit=50Path Parameters:
user_id(int, required): The user ID to get recommendations for
Query Parameters:
limit(int, default: 50): Number of recommended moviesusers_limit(int, default: 50): Number of similar users to consider
Response:
{
"recommended_movies": [
{
"movieId": 456,
"title": "The Matrix (1999)",
"genres": "Action|Sci-Fi|Thriller",
...
},
...
],
"users_class": [
{
"userId": 23,
"rating": 4.5,
...
},
...
],
"user_class_name": "2"
}Example:
curl -X POST "http://localhost:8000/movies/1?limit=20&users_limit=30"Error Responses:
404 Not Found:
{
"detail": "Utilisateur non trouvΓ©. Ce code sera optimsΓ© pour gΓ©nΓ©rer une recommandation mΓͺme pour un utilisateur non prΓ©sent dans la base de donnΓ©es"
}400 Bad Request (invalid pagination):
{
"detail": "Invalid pagination params"
}FastAPI provides automatic interactive documentation:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
- Python 3.8+
- pip or conda
- Clone the repository:
git clone https://github.com/RYV8/Recommendation_syteme.git
cd Recommendation_syteme- Create virtual environment:
python -m venv env
source env/bin/activate # On Windows: env\Scripts\activate- Install dependencies:
cd backend
pip install -r requirements.txt- Prepare data:
Place your datasets in
backend/data/:
movies_dataset_uncleaned.csvuser_dataset_uncleaned.csv
- Prepare models:
Place trained models in
backend/models/:
model_user.jlpmodel_movie.jlp
cd backend/api
uvicorn main:app --reload --host 0.0.0.0 --port 8000The API will be available at http://localhost:8000
import requests
# Get movies
response = requests.get("http://localhost:8000/movies?limit=10")
movies = response.json()
# Get recommendations for user
response = requests.post("http://localhost:8000/movies/1?limit=20")
recommendations = response.json()
print(f"User cluster: {recommendations['user_class_name']}")
print(f"Recommended {len(recommendations['recommended_movies'])} movies")
for movie in recommendations['recommended_movies'][:5]:
print(f" - {movie['title']}")// Get recommendations
fetch('http://localhost:8000/movies/1?limit=20', {
method: 'POST'
})
.then(response => response.json())
.then(data => {
console.log('Recommendations:', data.recommended_movies);
console.log('Similar users:', data.users_class);
});# Get 10 movies
curl "http://localhost:8000/movies?limit=10"
# Get recommendations for user 42
curl -X POST "http://localhost:8000/movies/42?limit=20"
# Get users with pagination
curl "http://localhost:8000/users?offset=100&limit=50"movieId,title,genres,rating,tag,tagId,relevance,tagger_userId,rater_userId
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,4.5,pixar,1,0.8,123,456Required columns:
movieId: Unique movie identifiertitle: Movie title with yeargenres: Pipe-separated genresrating: Average rating (optional, will be normalized)tag: User-generated tag (optional)
userId,movieId,rating,genres,user_tag
1,31,2.5,Crime|Drama,smartRequired columns:
userId: Unique user identifiermovieId: Movie the user interacted withrating: User's ratinggenres: Movie genresuser_tag: User's tag (optional)
recommendation_systems/
βββ README.md
βββ LICENSE
βββ .gitignore
βββ backend/
β βββ __init__.py
β βββ api/
β β βββ __init__.py
β β βββ main.py # FastAPI application
β βββ core/
β β βββ __init__.py
β β βββ config.py # Settings and configuration
β β βββ errors.py # Custom exceptions
β βββ data/
β β βββ movies_dataset_uncleaned.csv
β β βββ user_dataset_uncleaned.csv
β βββ domain/
β β βββ __init__.py
β β βββ repositories.py # Data access interfaces
β β βββ schemas.py # Pydantic models
β β βββ services.py # Business logic interfaces
β βββ infrastructure/
β β βββ __init__.py
β β βββ data_processing.py # Feature engineering
β β βββ models.py # ML model service
β β βββ processors.py # Data processors
β β βββ repositories.py # Data access implementations
β βββ models/
β β βββ model_user.jlp # User clustering model
β β βββ model_movie.jlp # Movie clustering model
β βββ services/
β βββ __init__.py
β βββ recommendations.py # Recommendation logic
βββ frontend/ # (Future UI implementation)
Create a .env file in the root directory:
# API Configuration
API_HOST=0.0.0.0
API_PORT=8000
DEBUG=True
# Data Paths
DATA_DIR=backend/data
MODEL_DIR=backend/models
# Cache Settings
ENABLE_CACHE=True
# Pagination Defaults
DEFAULT_LIMIT=100
MAX_LIMIT=1000Models are loaded automatically from backend/models/:
model_user.jlp: Joblib-serialized scikit-learn model for user clusteringmodel_movie.jlp: Joblib-serialized scikit-learn model for movie clustering
Solution: Models and data are loaded on first request. Subsequent requests use cache and are faster.
Solution: Already fixed! The handle_rating() function now handles zero-variance data gracefully.
Solution: Use pagination parameters:
curl "http://localhost:8000/movies?limit=50"Solution: Already fixed! Movies are deduplicated by movieId before returning.
- Use pagination: Always specify reasonable
limitvalues - Cache warmup: Make a test request on startup to load models
- Concurrent requests: FastAPI handles multiple requests efficiently
- Data size: Keep CSV files optimized (large files now ignored in git)
- Add user authentication
- Implement collaborative filtering with matrix factorization
- Add real-time model updates
- Create frontend dashboard
- Add A/B testing framework
- Implement recommendation explanations
- Add more sophisticated ranking algorithms
- Support for new user cold-start problem
This project is licensed under the terms included in the LICENSE file.
Contributions are welcome! Please feel free to submit a Pull Request.
For questions or support, please open an issue on GitHub: https://github.com/RYV8/Recommendation_syteme
Built with:
- FastAPI for the REST API
- scikit-learn for machine learning models
- pandas for data processing
- joblib for model serialization
- pydantic for data validation