Input Format

This repository contains a federated learning approach for neural collaborative filtering (idea1) on news articles. We did not find any existing open source code for federated neural collaborative filtering. This repository can be used to test the novel approach described in [1] which combines content information with the user history. The repo allows you to compare this approach against common recommender system models as baselines.

Algorithm description

We expect the items to be text. The approach embeds the items via BERT into a vector space. The users are than embedded via these item embeddings by averaging over all item vectors of items the user liked (in the training data). For each user item pair we concatenate the user and item vector and feed it into a neural network. We also sample a item the user did not like and feed it into the network. We then calculate the pairwise loss between the positive and the negative sample and backpropagate.

Input Format

File metadata.csv

Contains two columns:

"resource_id": unique id of the article
"text": full body article text without html tags.
"publication_date": date of publication.

File user_item_matrix.pq

Contains three columns:

"user_id": unique id of the user
"resource_id": unique id matching the one in metadata.csv
"time": timestamp of the click

Get Started

Environement

Python 3.6.8

Packages are in requirements.txt

Usage

Get started

You can run run_all.sh to generate some dummy data and run all the algorithms. The jupyter notebook results/evaluate.ipynb displays the results.

Connect your data

To work with your own data create a folder and copy the data in the format described above. Run DATA_FOLDER=your_data_folder python preprocessing.py to preprocess your data. You can then run any of the algorithms with DATA_FOLDER=your_data_folder python algorithm.py

Algorithms

content_based.py: Embeds articles with tf-idf, creates user-vectors by averaging over the articles. Creates a ranked list of recommendations by cos-similarity. To use your own embedding use content_predict_custom_embedding
idea1.py: Embeds the article with a pretrained bert embedding. Embeds the users as the average of the article vectors the user read. Trains a neural network with pairwise loss and creates predictions. (you can use your own embedding as well). For the decription of all the parameters look at chapter Parameters for Idea1
popularity_random: Predicts simply the most popular articles. Predicts random articles.
mf_model: Uses ALS to embed the articles and users as latent vectors. Then predicts the top articles for each user based on the latent vectors.
folder example_scripts:
- idea1_timewise_sampling.py is an example for the idea1 with timewise sampling instead of random sampling
- base_FL.py is an example for federated learning. Both users have the same data.
- idea2.py is an example, where the corresponding user and item vector are not concatenated from the beginning but rather have some individual layers before the vectors are concatenated.

Results

Results from the evaluation are printed and saved in results/{evaluation_name}

Training history of the loss and metrics are in results/idea1_models/{model_name}

Data Flow:

The flow through the system is as follows:

formatted data: The preprocessing script transforms the data into horizontal format and generates a train, validation, test split. It stores the new data in the folder given by DATA_FOLDER ('processed' by default). The load_data from the preprocessing module will load the formatted train,test and validation data. We have two formats: vertical, where each user item pair is one row and horizontal, where we have one row per user containing the list of clicked items .
preprocessed/embedded data: Each individual algorithm processes the data and prepares it for training. For Idea1 this step embeds the users and articles into a vector representation and stores these lookup tables in the " processed"-folder.
training: Trains and outputs the trained model.
prediction: Takes the model and a list of users as input and returns a sorted list of recommendations for each user.
evaluation: Takes the predicted list of recommendations and the ground truth as input, and calculates the evaluation metrics.

Modularity

In the Data Flow we see that every algorithm needs to load the data and evaluate on the data. These two steps are done in the preprocessing resp. the evaluation module.

preprocessing.py: Before we run any algorithm we first need to generate the formatted data. This is done by running this module as a script. After this is done, each algorithm simply calls preprocessing.load_data to load the formatted data.
evaluation.py: This module expects two pd.Series of users:
- prediction: containing a sorted list of article-IDs where the first item is the first ranked article for each user.
- ground_truth: containing a list of article-IDs representing the actually read articles for each user.
  The script calculates Recall@k for k=5,10,50,100 and NDCG@k for k=10,100 from these two pd.Series. Make sure that the index of the two pd.Series match!. It is the responsability of the prediction algorithm to exclude already read articles in the prediction.

Parameters for Idea1

Main parameters:

Parameter Name	Description	Default
lr	Learning Rate for optimizer	0.00001
batch_size	Number of samples in each batch	100
epochs	Number of epochs to train	50
layers	Defines the layers and nodes in the layers. e.g. [a,b,c] will result in a 3 layer network with a nodes in layer 1, b nodes in layer 2, c nodes in layer 3. There is a second possible structure: [[a,b],[c,d] which means that the user and item vector first go through separate layers a(user) and b(user) resp. a(item), b(item) before they are concatenated. Afterwards the concatenated vector is run through layer c and d.	[1024, 512, 8]
dropout	Dropout value after each layer	0.5
reg	L2 Regularization applied to each layer. 0 Means no regularization.	0
early_stopping	Stop training if we do not see a decrease of the validation loss in the last `early_stopping` training rounds. 0 means no early stopping	0
stop_on_metric	If True then the early stopping criterion switches to the metrics in the evaluation step. i.e. stop if neither NDCG@100 nor Recall@10 from the validation set did increase in the last `early_stopping` training rounds	False
random_sampling	Whether to use random sampling (True) or timewise sampling (False). timewise sampling expects vertical format loaded with load_data_vertical and prepocessed negative samples	True
folder	Only used for timewise sampling. Working folder to store negative samples	None

Other parameters:

Parameter Name	Description	Default
alpha	Proportion of pairwise loss compared to pointwise loss. Loss is calculated as alphapairwise_loss+(1-alpha)pointwise loss	1
dropout_first	Dropout for the input	same as dropout
normalize	Type of normalization to apply. 0=no normalization. 1=normalize concatenated user and item vector together. 2=normalize user and item vector separatly	0
interval	Evaluation interval. Calculate metrics every `interval` epochs	1
checkpoint_interval	Store the model every `checkpoint_interval` epochs	1
loss	Type of pairwise loss to use. Can be either TOP or BPR	"BPR"
optimizer	What optimizer to use. Can be one "ADAM" or "SGD"	"ADAM"
take_target_out	If set to True then the vector of the current positive sample is taken out of the user vector.	False
workers	Number of workers to use for feeding the data to the network.	1
train	Whether to train the network (True) or simply load it (False)	True
round	Only used in Federated Learning. Current federated learning training round	False
epsilon	Only used in Federated Learning. Epsilon value to calculate the noise while training. 0 means no noise	0

References

[1] Wanyu Chen, Fei Cai, Honghui Chen, Maarten de Rijke (2019). Joint Neural Collaborative Filtering for Recommender Systems https://arxiv.org/abs/1907.03459

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Algorithm description

Input Format

File metadata.csv

File user_item_matrix.pq

Get Started

Environement

Usage

Get started

Connect your data

Algorithms

Results

Data Flow:

Modularity

Parameters for Idea1

References

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
example_scripts		example_scripts
results		results
LICENSE		LICENSE
README.md		README.md
blindtext		blindtext
content_based.py		content_based.py
evaluation.py		evaluation.py
helper.py		helper.py
idea1.py		idea1.py
mf_model.py		mf_model.py
popularity_random.py		popularity_random.py
preprocessing.py		preprocessing.py
requirements.txt		requirements.txt
run_all.sh		run_all.sh

License

mediatechnologycenter/recommender-systems

Folders and files

Latest commit

History

Repository files navigation

Algorithm description

Input Format

File metadata.csv

File user_item_matrix.pq

Get Started

Environement

Usage

Get started

Connect your data

Algorithms

Results

Data Flow:

Modularity

Parameters for Idea1

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages