- Problem Statement
- Overview
- How it Works
- Workflow
- Impact
- Features
- Setup
- Folder Structure
- Challenges & Solutions
- Future Improvements
- License
- With the rise of streaming services, users now have access to thousands of movies across platforms.
- As a result, many viewers spend more time browsing than watching content.
- This leads to frustration, lower satisfaction, and reduced watch time on the platform.
- Over time, this impacts both user retention and platform engagement.
- Built a content-based movie recommender system trained on 5,000+ movie metadata records.
- Generated the top 5 similar titles for any selected movie in under 3 seconds using cosine similarity.
- Integrated the TMDB API to dynamically fetch and display movie posters, improving the user experience.
- Deployed the system as a Streamlit web app, enabling users to explore personalized movie suggestions.
- The dataset contains metadata for each movie including title, keywords, genres, cast, crew, and overview.
- All the features are combined into a new column called
tagsto create a unified representation for each movie.
- Text preprocessing is applied to the
tagscolumn :- All text is converted to lowercase (e.g.,
"Action, Thriller"becomes"action, thriller"). - Spaces between words are removed (e.g.,
"action movie"becomes"actionmovie"). - Stemming is performed to reduce words to their root form (e.g.,
"running"becomes"run").
- All text is converted to lowercase (e.g.,
CountVectorizeris used to convert thetagscolumn into numerical feature vectors.cosine_similarityis used to calculate similarity between the vector representations of all the movies.- The resulting similarity matrix is serialized and saved as a
.pklfile for efficient loading during recommendation. - A Streamlit web application is built to provide an interactive interface for movie selection and recommendation :
- User selects a movie from the dropdown list.
- The system recommends the top 5 most similar movies based on the similarity scores.
- Movie posters are fetched using the TMDB API to enhance the visual appeal of the recommendations.
Access the Streamlit Web Application here or Click on the Image below.
- Reduced browsing time by instantly suggesting the top 5 most similar movies for any selected title.
- Delivered movie recommendations in under 3 seconds, ensuring a fast and smooth user experience.
- Improved content engagement by guiding users toward relevant titles instead of manual browsing.
- Served 100+ users through a deployed web app, turning a notebook model into a live recommendation system.
- The project follows a modular approach by organizing modules into a dedicated
utils/directory. - Each module in the
utils/directory is responsible for a specific task and includes :- Clear docstrings explaining functionality, expected inputs/outputs, returns, and raises.
- Robust exception handling for better debugging and maintainability.
- Following the DRY (Don't Repeat Yourself) principle, this design :
- Reuses functions across notebooks and scripts without rewriting code.
- Saves development time and reduces redundancy.
- The
utils/directory also includes an__init__.pyfile, which serves a few important purposes in Python :- The
__init__.pyfile tells Python to treat the directory as a package, not just a regular folder. - Without it, Python won't recognize the folder as a package.
- The
- To access these utility modules anywhere in the project, add the following snippet at the top of your script :
import sys, os
sys.path.append(os.path.abspath("../utils"))- This is one of the functions I added to my project as the
export_data.pymodule in theutils/directory.
Click Here to View Example Function
import os
import pandas as pd
def export_as_csv(dataframe, folder_name, file_name):
"""
Exports a pandas DataFrame as a CSV file to a specified folder.
Parameters:
dataframe (pd.DataFrame): The DataFrame to export.
folder_name (str): Name of the folder where CSV file will be saved.
file_name (str): Name of the CSV file. Must end with ".csv" extension.
Returns:
None
Raises:
TypeError: If input is not a pandas DataFrame.
ValueError: If file_name does not end with ".csv" extension.
FileNotFoundError: If folder does not exist.
"""
try:
if not isinstance(dataframe, pd.DataFrame):
raise TypeError("Input must be a pandas DataFrame.")
if not file_name.lower().endswith(".csv"):
raise ValueError("File name must end with '.csv' extension.")
current_dir = os.getcwd()
parent_dir = os.path.dirname(current_dir)
folder_path = os.path.join(parent_dir, folder_name)
file_path = os.path.join(folder_path, file_name)
if not os.path.isdir(folder_path):
raise FileNotFoundError(f"Folder '{folder_name}' does not exist.")
dataframe.to_csv(file_path, index=False)
print(f"Successfully exported the DataFrame as '{file_name}'")
except TypeError as e:
print(e)
except ValueError as e:
print(e)
except FileNotFoundError as e:
print(e)- Instead of hardcoding file paths, the project uses Python's built-in
osmodule to handle paths dynamically. - This improves code flexibility, ensuring it runs smoothly across different systems and environments.
- Automatically adapts to the system's directory structure.
- Prevents
FileNotFoundErrorcaused by rigid, hardcoded paths. - Makes deployment and collaboration easier without manual path updates.
Click Here to View Code Snippet
current_dir = os.getcwd()
parent_dir = os.path.dirname(current_dir)
folder_path = os.path.join(parent_dir, folder_name)
file_path = os.path.join(folder_path, file_name)- The project uses Streamlit's
st.secretsfeature to securely manage the TMDB API key during development. - A
secrets.tomlfile is placed inside the.streamlit/directory, storing the API key in the following format :
[tmdb]
api_key = "your_api_key_here"- The API key is accessed in code using :
api_key = st.secrets["tmdb"]["api_key"]Caution
The secrets.toml file should not be pushed to a public repository to avoid exposing sensitive credentials.
You can add it to .gitignore to ensure it's excluded from version control.
When deploying to Streamlit, the API key must be added via the GUI, not through the secrets.toml file.
- In the project, a similarity matrix is computed to recommend movies.
- However, due to its high dimensionality, the matrix becomes very large and exceeds GitHub's size limitations.
- GitHub restricts uploads larger than 100MB in public repositories, making it unsuitable for storing large files.
- While Git LFS (Large File Storage) is one option, it can be complex to configure and manage.
- To address this issue, the matrix file is :
- Uploaded to Google Drive.
- Downloaded at runtime using the
gdownlibrary. - Stored locally on the Streamlit server when the app runs.
- This approach ensures :
- Compatibility with GitHub without needing Git LFS.
- Hassle-free experience for cloning the repository or running the app across different environments.
Click Here to View Code Snippet
import os
import gdown
import pickle
# Step 1 : Define the Google Drive file ID
file_id = "your_file_id_here"
# Step 2 : Set the desired file name for the downloaded file
output = "similarity.pkl"
# Step 3 : Construct the direct download URL from the file ID
url = f"https://drive.google.com/uc?id={file_id}"
# Step 4 : Check if the file already exists locally
# If not, download it from Google Drive using gdown
if not os.path.exists(output):
gdown.download(url, output, quiet=False)
# Step 5 : Open the downloaded file in read binary mode
# and load the similarity matrix using pickle
with open("similarity.pkl", "rb") as f:
similarity = pickle.load(f)Follow these steps carefully to set up and run the project on your local machine :
First, you need to clone the project from GitHub to your local system.
git clone https://github.com/themrityunjaypathak/Pickify.gitTo avoid version conflicts and keep your project isolated, create a virtual environment.
On Windows :
python -m venv .venvOn macOS/Linux :
python3 -m venv .venvAfter setting up the virtual environment, activate it to begin installing dependencies.
On Windows :
.\.venv\Scripts\activateOn macOS/Linux :
source .venv/bin/activateNow, install all the required libraries inside your virtual environment using the requirements.txt file.
pip install -r requirements.txtTip
It's a good idea to upgrade pip before installing dependencies to avoid compatibility issues.
pip install --upgrade pipNote
The .streamlit/ folder contains Streamlit configuration settings.
However, it is not necessary to include it in your project unless required.
- The
config.tomlfile contains configuration settings such as the server settings, theme preferences, etc.
[theme]
base="dark"
primaryColor="#FD3A84"
backgroundColor="#020200"- The
secrets.tomlfile contains sensitive information like API keys, database credentials, etc.
[tmdb]
api_key = "your_tmdb_api_key_here"Important
Make sure not to commit your secrets.toml to GitHub or any public repositories.
You can add it to .gitignore to ensure it's excluded from version control.
After everything is setup, you can run the Streamlit application :
streamlit run app.pyOnce you're done working, you can deactivate your virtual environment :
deactivatePickify/
|
├── .streamlit/ # Streamlit Configuration Files
├── raw_data/ # Original Dataset
├── clean_data/ # Preprocessed and Cleaned Dataset
├── notebooks/ # Jupyter Notebooks for Preprocessing and Vectorization
├── images/ # Images used in the Streamlit Application
├── utils/ # Modular Python Scripts
├── app.py # Main Streamlit Application
├── requirements.txt # List of required libraries for the Project
├── README.md # Detailed documentation of the Project
├── LICENSE # License specifying permissions and usage rights
├── .gitignore # All files and folders excluded from Git Tracking
- Solution : Used Python's
osmodule for dynamic, platform-independent path handling.
- Solution : Structured the project with modular scripts inside the
utils/package.
- Solution : Used Streamlit
st.secretsto securely store and access TMDB API credentials.
- Solution : Used Google Drive to host the serialized similarity matrix and download it at runtime using
gdown.
- Currently, tags are generated with equal importance from cast, crew, keywords, genres, and overview.
- This can be improved by applying feature weighting to give more importance to influential attributes.
- For example, certain columns can be scaled or repeated to increase their impact on similarity calculations.
- Introduce user-based data to generate more personalized recommendations.
- Collaborative filtering can suggest movies based on similarities between user preferences and behavior.
- This would make the recommender system more adaptive and user-centric.
- Fetch movie data from external sources to keep the database continuously updated.
- This would enable recommending newly released movies and automatically removing outdated content.
- Instead of relying only on cosine similarity, experiment with additional similarity techniques.
- Methods such as Jaccard similarity, TF-IDF, or Word2Vec could better capture semantic relationships.
- Improve the user experience by adding filters to explore movies based on genres, actors, or directors.
This project is licensed under the MIT License. You are free to use and modify the code as needed.



