Recommend movies to a user based on previously watched and rated movies. Both non-personalized and personalized recommender systems are tested and compared.
Non personalized approaches
- Popularity-based: recommend the top 10 most popular movies to each user
- Average ranking: recommend the top 10 both top ranked and most popular movies to each user
- Pair-wise: recommend the top 10 most popular movies to each user by taking into account the proximity (permutations) to the last movie the user watched
Personalized approaches
- Content-based recommendations: recommend movies with similar characteristics to previously watched ones
- Genre-based recommendations, Jaccard similarity for big data
- Plot/summary-based recommendations, NLP preprocessing, TF-IDF transformation and cosine similarity
- Generate a user profile as a TFIDF embedding and produce similar recommendations
- Collaborative Filtering
- Sparse item-item matrix
- KNN similar user ratings of unseen movies (cosine similarity)
- Matrix factorization approach using SVD
- Hybrid recommender
Two datasets are loaded to showcase the recommender system techniques utilised.
The Movielens 20m dataset (ml-20m) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 20000263 ratings and 465564 tag applications across 27278 movies. These data were created by 138493 users between January 09, 1995 and March 31, 2015. This dataset was generated on March 31, 2015, and updated on October 17, 2016 to update links.csv and add genome-* files.
Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.
The data are contained in six files, genome-scores.csv
, genome-tags.csv
, links.csv
, movies.csv
, ratings.csv
and tags.csv
.
Download from: https://grouplens.org/datasets/movielens/20m/
Additionally, the Wikipedia Movie Plots dataset is used, which contains ~35K plot summary descriptions scraped from Wikipedia.
The data is contained in one file: wiki_movie_plots_deduped.csv
Download from: https://www.kaggle.com/jrobischon/wikipedia-movie-plots
Download the data and place it inside a data/ directory. Then run:
$ python prep.py
$ python recommender_systems.py
For state-of-the-art recommender engines on Movielens 20m dataset https://paperswithcode.com/sota/collaborative-filtering-on-movielens-100k