This repository contains the code and resources related to the research paper titled "NodeFlow: Towards End-to-end Flexible Probabilistic Regression on Tabular Data" by Anonymous Authors.
We introduce NodeFlow, a flexible framework for probabilistic regression on tabular data that combines Neural Oblivious Decision Ensemble (NODE) and Conditional Continuous Normalizing Flows (CNF). It offers improved modeling capabilities for arbitrary probabilistic distributions, addressing the limitations of traditional parametric approaches. In NodeFlow, the NODE captures complex relationships in tabular data through a tree-like structure, while the conditional CNF utilizes the NODE's output space as a conditioning factor. The training process of NodeFlow employs standard gradient-based learning, facilitating end-to-end optimization of the NODEs and CNF-based density estimation. This approach ensures superior performance, ease of implementation, and scalability, making NodeFlow an appealing choice for practitioners and researchers. Extensive evaluations on benchmark datasets demonstrate the effectiveness of NodeFlow, showcasing its superior accuracy in point estimation and robust uncertainty quantification. Furthermore, ablation studies are conducted to justify the design choices of NodeFlow. In conclusion, NodeFlow's end-to-end training process and strong performance make it a compelling solution for practitioners and researchers. Additionally, it opens new avenues for research and application in the field of probabilistic regression on tabular data.
Nodeflow and other models in this repository use Pytorch Lightning for training and inference. Furthermore, we used Optuna framework to perform hyperparamter tuning.
Full list of requirements is provided in therequirements.in
file.
Use pip-tools
to compile requirements:
python3 -m venv venv
source venv/bin/activate
pip install pip-tools
pip-compile requirements.in
pip install -r requirements.txt
pip install -e .
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from lightning.pytorch import Trainer
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from probabilistic_flow_boosting.models.nodeflow import NodeFlow, NodeFlowDataModule
x, y = load_diabetes(return_X_y=True)
y = y.reshape(-1, 1)
x_train, x_test, y_train, y_test = train_test_split(x, y)
datamodule = NodeFlowDataModule(x_train, y_train, x_test, y_test, split_size=0.8, batch_size=2048)
model_hyperparameters = {
"num_layers": 3,
"depth": 2,
"tree_output_dim": 1,
"num_trees": 100,
"flow_hidden_dims": [8, 8]
}
model = NodeFlow(input_dim=10, output_dim=1, **model_hyperparameters)
trainer = Trainer(max_epochs=100, devices=1, enable_checkpointing=False, inference_mode=False,)
trainer.fit(model, datamodule=datamodule)
test_datamodule = NodeFlowDataModule(x_train, y_train, x_test, y_test, split_size=0.8, batch_size=2048)
test_results = trainer.test(model, datamodule=test_datamodule)
samples = trainer.predict(model, datamodule=test_datamodule, return_predictions=True)
samples = np.concatenate(samples).reshape(-1, 1000)
# Plot KDE plot for the first observation, i.e. index = 0.
plt.axvline(x=y_test[0], color='r', label='True value')
sns.kdeplot(samples[0, :], color='blue', label='TreeFlow')
plt.legend()
plt.show()
|── conf/ # Configuration files
├── data/ # Datasets and trained models
├── images/ # Images for documentation purposes
├── logs/ # Logs generated during experiments execution
├── notebooks/ # Jupyter notebooks for models analysis
├── src/ # Source code
├── README.md # This document
└── ...
Source code in src/probabilistic_flow_boosting
directory contains the following:
models
- Models implementations.extras
- Kedro wrappers for the TreeFlow model and UCI Datasets.pipelines
- Kedro pipelines for the experiments.
The full data folder can be found under the following link: Link. The full trained models can be found under the following link: Link More details regarding the datasets can be found in the paper in the appendix directory.
Experiments are created using Kedro framework, and it's definition is in the src/probabilistic_flow_boosting/pipeline_registry.py
file.
The most important configuration file is in the conf/base/parameters/modeling/nodeflow.yml
file.
Here you define the hyperparameters for the grid search of the parameters. It needs to be adjusted for each dataset separately.
Inside the container run the following command.
python -m kedro run --pipeline <pipeline_name>
where pipeline_name
is the name of the pipeline defined in the src/probabilistic_flow_boosting/pipeline_registry.py
file.
If you use this code or the research findings in your work, please cite our paper:
@article{WielopolskiFZ24,
AUTHOR = {Wielopolski, Patryk and Furman, Oleksii and Zięba, Maciej},
TITLE = {NodeFlow: Towards End-to-End Flexible Probabilistic Regression on Tabular Data},
JOURNAL = {Entropy},
VOLUME = {26},
YEAR = {2024},
NUMBER = {7},
ARTICLE-NUMBER = {593},
URL = {https://www.mdpi.com/1099-4300/26/7/593},
PubMedID = {39056955},
ISSN = {1099-4300},
DOI = {10.3390/e26070593}
}
In case of questions or comments please contact using LinkedIn: Patryk Wielopolski