Data used : https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata
Framework used : Streamlit
This project is a movie recommendation system that provides recommendations based on movie similarity using content-based filtering. It involves data preprocessing, feature engineering, vectorization, model building, and deployment using Streamlit for a user-friendly interface.
Data Ingestion -> Feature Engineering -> Vectorization -> Similarity Calculation-> Model Saving -> Streamlit Interface -> User Interaction -> Recommendation Display
Raw Data -> Convert JSON Columns -> Concatenate Tags -> Clean Text -> Generate Vectors
- Data Preparation: Merging datasets and extracting relevant features.
- Feature Engineering: Transforming raw data into meaningful tags.
- Vectorization: Converting text data into vectors using CountVectorizer.
- Model Building: Computing cosine similarity to measure movie similarity.
- Deployment: Building a web interface using Streamlit.
-
Two datasets, tmdb_5000_movies.csv and tmdb_5000_credits.csv, are merged on the movie title to consolidate information.
2.1 Extracting Key Information:
Several columns contain complex, nested information. To prepare them for analysis, these fields are transformed using helper functions.
Genres and Keywords: Extract genre and keyword names.
Cast: Retrieve up to three main cast members.
Crew: Extract the director's name.
2.2. Text Preprocessing
Text data is tokenized and cleaned by removing spaces and converting to lowercase to ensure uniformity. All tags are combined into a single feature (tags).
-
The CountVectorizer is used to convert text data into numerical vectors. Stemming with the Porter Stemmer reduces words to their root forms, enhancing the model's ability to recognize similarities.
4.1. Calculating Similarity
Cosine similarity is used to find similarities between movies. For a given movie, the function recommend() retrieves the top 5 similar movies based on cosine distances.
4.2. Saving Model Artifacts
The similarity matrix and processed data are saved as pickle files to be loaded in the deployment environment.
5.1. Setting Up Streamlit
The web interface, created using Streamlit, provides an interactive selection menu for users to choose a movie and receive recommendations. The movie poster is fetched
using the TMDB API.
5.2. Fetching Movie Posters
The fetch_poster() function uses the TMDB API to retrieve posters of recommended movies.