Recommender systems are algorithms designed to suggest items to users based on various inputs, such as their past behavior and preferences. These systems are essential in many domains, including e-commerce, music, movies, and anime, as they enhance user experience by providing personalized recommendations.
Importance in the Context of Anime
The anime industry produces a vast and diverse array of content, making it challenging for users to find anime that align with their interests. A well-designed recommender system helps users navigate the extensive anime catalog, discovering titles they are likely to enjoy based on their unique tastes. By delivering personalized recommendations, these systems can increase user engagement and satisfaction, fostering a deeper connection with the platform.
The primary objective of this project is to develop a hybrid recommender system that combines collaborative and content-based filtering techniques. This system aims to predict how users will rate anime titles they have not yet seen, using their historical preferences and interactions. By accurately predicting user ratings, the recommender system will provide tailored anime suggestions, enhancing the user's discovery process and overall viewing experience.
This dataset contains information on user preference data from 570,355 users on 12,294 anime. It contains information about anime and user ratings, split across several CSV files. It's designed for a recommender systems project where we will predict user ratings for unseen anime for 633,686 users.
Data Summary:
- Anime Information (anime.csv): Contains details about each anime, including title, genre, type (movie, TV, etc.), number of episodes, average rating, and member count.
- User Ratings (train.csv): Records user ratings for specific anime entries. Includes user ID, anime ID, and the rating provided (or -1 if watched but not rated).
- Test Set (test.csv): Contains user IDs and anime IDs for which we will predict ratings. No ratings are provided in this file.
- Submission Format (submission.csv - for reference): Illustrates the expected format for our final predictions. It combines user ID and anime ID into a single ID and includes the predicted rating for each user-anime pair.
Additional Information:
- Original data comes from MyAnimeList.net. Source: Adapted open-source Kaggle dataset (https://www.kaggle.com/datasets/CooperUnion/anime-recommendations-database).
To carry out all the objectives for this repo, the following necessary dependencies were loaded:
Pandas 2.2.2
andNumpy 1.26
Matplotlib 3.8.4
Data cleaning is a crucial step in the data analysis process, involving the correction or removal of incorrect, corrupted, duplicate, or incomplete data within a dataset. Through various techniques such as filling missing values, removing outliers, and standardizing data formats, it ensures the accuracy and reliability of subsequent analyses and decision-making. This involves rectifying missing or erroneous values, removing duplicates, and resolving inconsistencies.
Exploratory Data Analysis (EDA) serves as a foundational step in the data preprocessing phase, aimed at understanding the underlying patterns, relationships, and structure of the data. It involves various techniques such as visualizations, summary statistics, and correlation analysis to uncover insights and identify potential issues. EDA helps in forming hypotheses, guiding further analysis, and making informed decisions about data preprocessing and modeling.
Data preprocessing is a crucial step in building any machine learning model. It involves transforming raw data into a clean, structured format that is suitable for analysis. This step includes various tasks such as encoding categorical variables, normalizing data, and feature extraction. Proper data preprocessing ensures that the data is of high quality, which in turn improves the performance of the machine learning models. This involves encoding categorical variables, ensuring uniformity and compatibility for subsequent analyses.
Various models were trained and evaluated.
- For Content-Based Filtering:
- Cosine similarity & Sigmoid kernel
- For Collaborative Filtering:
- Singular Value Decomposition (SVD)
- CoClustering
- BaselineOnly
- SlopeOne
- Mlflow
- Streamlit
- The primary evaluation metric, RMSE, was employed to assess model accuracy. Results indicate that the SVD model outperformed the BaselineOnly model in predicting user ratings.
- Continuously update the model with new data to maintain accuracy.
- Explore deep learning models like neural networks for improved performance.
- Integrate user feedback to refine and enhance the recommender system.
Name | |
---|---|
[Khululiwe Hlongwane] | [khululiwe.hlongwane9966@gmail.com] |
[Judith Kabongo] | [jkabongo19@gmail.com] |
[Ntembeko Mhlungu] | [ntembekomhlungu1@gmail.com] |
[Tselane Moeti] | [tsemoeti24@gmail.com] |
[Masixole Nondumo] | [m.nondumo@gmail.com] |
[Nolwazi Vezi] | [nolwazinvd@gmail.com] |