This project implements a recommendation system using the Polars DataFrame library. The system recommends products to reviewers based on the Jaccard similarity of their review histories.
To run this project, you'll need to install the following libraries:
Polars is a blazingly fast DataFrame library implemented in Rust and runs on Python.
!pip install polars
Polars-Distance is a separate package/polars-plug-in that provides additional distance functions for Polars. In our case, we will use it to calculate the Jaccard similarity between two lists.
!pip install polars-distance
Note: After installing Polars and Polars-Distance, you may need to restart your Jupyter notebook or Python environment for the changes to take effect.
The dataset used in this project is from the Amazon product data, specifically the Musical Instruments review dataset and its metadata. The dataset can be found here.
- Load the Datasets: The Musical Instruments review dataset and metadata are loaded into Polars DataFrames.
- Join and Select Relevant Columns: The review and metadata DataFrames are joined on the product identifier (ASIN) and the relevant columns (reviewer ID, ASIN, title, and overall rating) are selected.
- Filter Unique Reviewers: The first 3000 unique reviewers are selected, and the reviews are filtered to include only those from these reviewers.
- Remove Duplicates: Duplicates are removed based on reviewer ID and ASIN to ensure each reviewer-product pair is unique.
- Discretize Ratings: The overall ratings are discretized into categories (negative, average, positive) based on their values.
- Group by Reviewer: The reviews are grouped by reviewer ID to count the number of reviews per reviewer. Reviewers with more than five reviews are kept for further analysis.
- Create Product Lists: For each reviewer, a list of reviewed products (ASINs) is created.
- Generate Reviewer Pairs: All pairs of reviewers are generated using a cross join, excluding pairs where the reviewer ID is the same.
- Calculate Jaccard Similarity: The Jaccard similarity is calculated between the lists of products reviewed by each pair of reviewers. This is done simultaneously for all pairs using a lazy DataFrame.
- Define Recommendation Function: A function is defined to recommend products to a reviewer based on the Jaccard similarity. The function filters neighbors (reviewers with similar tastes), calculates weighted scores for products based on the neighbors' reviews, and aggregates these scores to recommend top products.
- Find Active Reviewers: Active reviewers (those with a high number of neighbors above a similarity threshold) are identified.
- Generate Recommendations: The recommendation function is used to suggest products to active reviewers, providing a list of top recommended products.