From the Challenge website: Ekstra Bladet and reference paper.
The dataset for the ACM RecSys Challenge 2024, named EB-NeRD, is a large-scale Danish dataset created by Ekstra Bladet to support advancements and benchmarking in news recommendation research. EB-NeRD includes data from over 2.3 million users and more than 380 million impression logs collected from Ekstra Bladet. The dataset was compiled by recording behavior logs from active users during a six-week period from April 27 to June 8, 2023. This specific timeframe was chosen to avoid major events, such as holidays or elections, that could result in atypical user behavior on Ekstra Bladet. To protect user privacy, anonymization was implemented using one-time salt mapping. In addition to user interaction data, the dataset includes news articles published by Ekstra Bladet, enriched with textual context features such as titles, abstracts, bodies, and categories. Moreover, the dataset provides features generated by proprietary models, including topics, named entity recognition (NER), and article embeddings. Participants were asked to predict which article a user will click on from a list of articles that were seen during a specific impression. In particular, the challenge's objective is to estimate the likelihood of a user clicking on each article by evaluating the compatibility between the article's content and the user's preferences. The articles are ranked based on these likelihood scores, and the precision of these rankings is measured against the actual selections made by users. preserving user privacy.
We participated in the challenge as FeatureSalad, a team of 6 MSc students from Politecnico di Milano:
We worked under the supervision of:
- Maurizio Ferrari Dacrema (Assistant Professor)
- Andrea Pisani (PhD student)
The first step is to download the dataset .parquet
files and place it in the folder dataset/
. In the end you should have a structure as follow:
├── Ekstra_Bladet_contrastive_vector
│ └── contrastive_vector.parquet
...
├── ebnerd_demo
│ ├── articles.parquet
│ ├── train
│ │ ├── behaviors.parquet
│ │ └── history.parquet
│ └── validation
│ ├── behaviors.parquet
│ └── history.parquet
...
To create the preprocessing:
sh ~/RecSysChallenge2024/src/polimi/scripts/run_all_preprocessing.sh
Now in ~/RecSysChallenge2024/preprocessing/train_ds.parquet
we have our complete preprocessed dataset ready to be used.
Our submission is based on stacking. To do so, we need to create the dataframe with the "first-level" model prediction. The procedure is:
- Train level one models on train set
- Run inference over the validation set for these models
- Train level two model using the validation set augmented with models one predictions.
Then, to create the train dataset for the testset:
- Train level one models on train + validation set
- Run inference over the testset for these models
Then we can generate testset predictions using previously trained level two models.
- Return as final prediction the average predictions of all level two models.
Model | Type | Level 1 | Level 2 |
---|---|---|---|
Catboost | Classifier | * | * |
Catboost | Ranker | * | |
LightGBM | Classifier | * | * |
LightGBM | Ranker | * | |
MLP | Classifier | * | |
GANDALF | Classifier | * | |
DEEP & CROSS | Classifier | * | |
WIDE & DEEP | Classifier | * |
The following table shows the hyperparameters used for each model and each preprocessing. Neural models have not been trained on the second version of the preprocessing, due to limit of time.
Model | Type | Configuration Path |
---|---|---|
Catboost | Classifier | ~/RecSysChallenge2024/configuration_files/catboost_classifier_recsys_best.json |
Catboost | Ranker | ~/RecSysChallenge2024/configuration_files/catboost_ranker_new_noK_95.json |
LightGBM | Classifier | ~/RecSysChallenge2024/configuration_files/lightgbm_cls_recsys_trial_107.json |
LightGBM | Ranker | ~/RecSysChallenge2024/configuration_files/lightgbm_ranker_recsys_trial_219.json |
MLP | Classifier | ~/RecSysChallenge2024/configuration_files/mlp_tuning_new_trial_208_early_stopped_long_with_pre.json |
GANDALF | Classifier | ~/RecSysChallenge2024/configuration_files/gandalf_tuning_new_trial_130_early_stopped_with_pre.json |
DEEP & CROSS | Classifier | ~/RecSysChallenge2024/configuration_files/deep_cross_tuning_new_trial_67_early_stopped_with_pre.json |
WIDE & DEEP | Classifier | ~/RecSysChallenge2024/configuration_files/wide_deep_new_trial_72_early_stopped_with_pre.json |
Note that to train each of this model the path of the desired preprocessing version is required, along with the correct configuration file path. Pass them as command line arguments.
Moreover, all models except the ranker have been trained on a subsample of the dataset. To create a subsample of the preprocessing you can run the following script:
python ~/RecSysChallenge2024/src/polimi/preprocessing_pipelines/subsample_train.py \
-output_dir ~/RecSysChallenge2024/experiments/ \
-dataset_dir ~/RecSysChallenge2024/preprocessing/... \
-original_path ~/RecSysChallenge2024/dataset/ebnerd_small/train/behaviors.parquet
where -dataset_dir
contains the path of the directory with the train_ds.parquet
preprocessing file.
python ~/RecSysChallenge2024/src/polimi/scripts/catboost_training.py \
-output_dir ~/RecSysChallenge2024/models \
-dataset_path ~/RecSysChallenge2024/preprocessing/... \
-catboost_params_file ~/RecSysChallenge2024/configuration_files/catboost_classifier_recsys_best.json \
-catboost_verbosity 20 \
-model_name catboost_classifier
python ~/RecSysChallenge2024/src/polimi/scripts/lightgbm_training.py \
-output_dir ~/RecSysChallenge2024/models \
-dataset_path ~/RecSysChallenge2024/preprocessing/... \
-lgbm_params_file ~/RecSysChallenge2024/configuration_files/...
python ~/RecSysChallenge2024/src/polimi/scripts/nn_training.py \
-output_dir ~/RecSysChallenge2024/models \
-dataset_path ~/RecSysChallenge2024/preprocessing/... \
-params_file ~/RecSysChallenge2024/configuration_files/... \
-model_name ...
In our solution Catboost ranker has been trained in batches due to memory limitations, an example of the used procedure can be found ~/RecSysChallenge2024/src/polimi/scripts/catboost_ranker_batch_training
inside that folder there's a file named _procedure_batch_training.txt
that explains the procedure.
If there are no memory constraint, you can train LightGBM and Catboost ranker by using the same script for the classifier described above by passing the argument --ranker
.
For Catboost/LightGBM, you can use the following script
python ~/RecSysChallenge2024/src/polimi/scripts/inference.py \
-output_dir ~/RecSysChallenge2024/inference \
-dataset_path ~/RecSysChallenge2024/preprocessing/... \
-model_path ~/RecSysChallenge2024/models/{model_name}/model.joblib \
-behaviors_path ~/RecSysChallenge2024/dataset/ebnerd_testset/test/behaviors.parquet \
-batch_size 1000000 \
--submit
Otherwise, for NN models
python ~/RecSysChallenge2024/src/polimi/scripts/nn_inference_batched.py \
-output_dir ~/RecSysChallenge2024/inference \
-dataset_path ~/RecSysChallenge2024/preprocessing/... \
-model_path ~/RecSysChallenge2024/models/{model_name} \
-params_file ~/RecSysChallenge2024/configuration_files/... \
-batch_size 5096 \
-behaviors_path ~/RecSysChallenge2024/dataset/ebnerd_testset/test/behaviors.parquet \
--submit
In case of inference over the validation set pass the flag –-eval
otherwise use the flag –-submit
.
python ~/RecSysChallenge2024/src/polimi/scripts/preprocessing_level_2.py \
-features_dir ~/RecSysChallenge2024/experiments \
-model_json ~/RecSysChallenge2024/configuration_files/... \
-output_dir ~/RecSysChallenge2024/stacking \
–train
Where -features_dir
is the directory path that contain the features of level 1 models.
Moreover, you can remove the flag --train
in case you are build the level 2 train dataset for the testset.
Regarding configuration files, those are the ones beign used for the final submission:
Model | Type | Configuration Path |
---|---|---|
CatBoost | Classifier | ~/RecSysChallenge2024/configuration_files/stacking_catboost_cls_features_double_iterations.json |
LightGBM | Classifier | ~/RecSysChallenge2024/configuration_files/stacking_lgbm_cls_features.json |
Finally, the last step is to do the average of the two models.
python ~/RecSysChallenge2024/src/polimi/scripts/generate_hybrid_submission.py \
-prediction_1 ~/RecSysChallenge2024/inference/Inference_stacking_Catboost/prediction_ds.parquet \
-prediction_2 ~/RecSysChallenge2024/inference/Inference_stacking_LightGBM/prediction_ds.parquet \
-original_path ~/RecSysChallenge2024/dataset/ebnerd_testset/test/behaviors.parquet \
-output_dir ~/RecSysChallenge2024/inference/Inference_stacking_Hybrid