Welcome to the SAH GitHub!
SAH is a repository for the problem of Reverse k-Maximum Inner Product Search (RkMIPS). Given a query (item) vector, a set of item vectors, and a set of user vectors, the problem of RkMIPS aims to find a set of user vectors whose inner products with the query vector are one of the k largest among the query and item vectors. This problem is essential for market analysis from a reverse perspective, i.e., the perspective of service providers instead of users. For example, when an e-commerce service promotes a discounted product or launches a new product, a vital issue for designing an effective advertising campaign is identifying the customers who may want to buy this product.
This repository provides the implementations and experiments of our work entitled SAH: Shifting-Aware Asymmetric Hashing for Reverse k Maximum Inner Product Search that has been published in AAAI 2023. Our arXiv version can be found here. We implement SAH for solving RkMIPS in high-dimensional Euclidean spaces and include the only known RkMIPS method Simpfer and a state-of-the-art k-Maximum Inner Product Search (kMIPS) method H2-ALSH as baselines for evaluation. To make a comprehensive yet systematic comparison, we also incorporate the RkMIPS optimizations of Simpfer into H2-ALSH and SA-ALSH as two new baselines named H2-Simpfer and SA-Simpfer, respectively.
We use five real-world recommendation datasets in our experiments for performing RkMIPS, i.e., Amazon-Auto, Amazon-CDs, MovieLens, Music100, and Netflix. For the datasets Amazon-Auto, Amazon-CDs, MovieLens, and Netflix, we first retrieve their sparse user-item rating matrix
Datasets | # Items | # Users | # Queries | # Dim | Itemset Size | Userset Size |
---|---|---|---|---|---|---|
Amazon-Auto | 925,387 | 3,873,247 | 100 | 100 | 370.1 MB | 1.5 GB |
Amazon-CDs | 64,443 | 75,258 | 100 | 100 | 25.7 MB | 30.1 MB |
MovieLens | 10,681 | 71,567 | 100 | 100 | 4.2 MB | 27.9 MB |
Music100 | 1,000,000 | 1,000,000 | 100 | 100 | 400.0 MB | 400.0 MB |
Netflix | 17,770 | 480,189 | 100 | 100 | 7.1 MB | 192.1 MB |
Moreover, we conduct a comparison of H2-ALSH and SA-ALSH for kMIPS. In addition to the five recommendation datasets used for perfroming RkMIPS, we we also choose five commonly used datasets to benchmark kMIPS algorithms, namely DEEP, Gist, GloVe, ImageNet, and Msong. The number of their vectors and the dimensionality
To reduce the I/O cost, all the datasets and queries are stored in a binary format. For a dataset (or query set) with n vectors in d dimensions, the binary file comparises with a float
array of n*d
length. Please specify the cardinality n and data dimension d in the bash scripts in advance.
- Ubuntu 18.04 (or higher version)
- g++ 8.3.1 (or higher version) and OpenMP.
- Python 3.7 (or higher version)
The source codes are implemented by C++
, which requires g++-8
with c++17
for compilation. Thus, please check whether the g++-8
is installed. If not, we recommend a way to install g++-8
in Ubuntu 18.04
(or higher versions) as follows.
sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install g++-8
sudo apt-get install gcc-8 (optional)
Once the g++-8
has been installed, users can use the following command to compile the C++
source codes in parallel:
cd methods/
make -j
We have provided bash scripts to run all experiments. Please ensure you have downloaded the datasets and completed the compilation.
To reproduce the RkMIPS experiments of H2-ALSH, H2-Simpfer, H2-Cone, SA-Simpfer, and SAH, please run the following commands:
cd methods/
bash run.sh
Moreover, one can run commands below to reproduce the RkMIPS experiments of Simpfer:
cd simpfer/
bash run.sh
To reproduce the kMIPS experiments of H2-ALSH and SA-ALSH, please run the following commands:
cd kmips/
bash run.sh
You can plot the curves of query time vs. f1-score for H2-ALSH, H2-Simpfer, and SAH by running the following commands:
cd curves/
bash run.sh
Finally, we have privoded python
scripts for visualization. These scripts require python 3.7
(or higher versions) with numpy, scipy, and matplotlib. If you do not have such an environment, you might need to use anaconda
to create a new virtual one and apply pip
to install those packages. After you have completed all experiments, you can plot all figures appeared in our AAAI 2023 paper and the supplementary material with the python scripts in visualization/
.
Thank you so much for being patient to read the user manual. We will appreciate using the following BibTeX to cite this work when you use SAH in your paper.
@inproceedings{huang2023sah,
title={SAH: Shifting-Aware Asymmetric Hashing for Reverse k Maximum Inner Product Search},
author={Huang, Qiang and Wang, Yanhao and Tung, Anthony KH},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={37},
number={4},
pages={4312--4321},
year={2023}
}
It is welcome to contact me (huangq@comp.nus.edu.sg) or yanhao (yhwang@dase.ecnu.edu.cn) if you meet any issue. Thank you.