Skip to content

pfilo8/NodeFlow

Repository files navigation

NodeFlow: Towards End-to-end Flexible Probabilistic Regression on Tabular Data

This repository contains the code and resources related to the research paper titled "NodeFlow: Towards End-to-end Flexible Probabilistic Regression on Tabular Data" by Anonymous Authors.

Abstract

We introduce NodeFlow, a flexible framework for probabilistic regression on tabular data that combines Neural Oblivious Decision Ensemble (NODE) and Conditional Continuous Normalizing Flows (CNF). It offers improved modeling capabilities for arbitrary probabilistic distributions, addressing the limitations of traditional parametric approaches. In NodeFlow, the NODE captures complex relationships in tabular data through a tree-like structure, while the conditional CNF utilizes the NODE's output space as a conditioning factor. The training process of NodeFlow employs standard gradient-based learning, facilitating end-to-end optimization of the NODEs and CNF-based density estimation. This approach ensures superior performance, ease of implementation, and scalability, making NodeFlow an appealing choice for practitioners and researchers. Extensive evaluations on benchmark datasets demonstrate the effectiveness of NodeFlow, showcasing its superior accuracy in point estimation and robust uncertainty quantification. Furthermore, ablation studies are conducted to justify the design choices of NodeFlow. In conclusion, NodeFlow's end-to-end training process and strong performance make it a compelling solution for practitioners and researchers. Additionally, it opens new avenues for research and application in the field of probabilistic regression on tabular data.

Table of Contents

Prerequisites

Nodeflow and other models in this repository use Pytorch Lightning for training and inference. Furthermore, we used Optuna framework to perform hyperparamter tuning. Full list of requirements is provided in therequirements.in file. Use pip-tools to compile requirements:

python3 -m venv venv
source venv/bin/activate
pip install pip-tools
pip-compile requirements.in
pip install -r requirements.txt
pip install -e .

Getting Started

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from lightning.pytorch import Trainer
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

from probabilistic_flow_boosting.models.nodeflow import NodeFlow, NodeFlowDataModule

x, y = load_diabetes(return_X_y=True)
y = y.reshape(-1, 1)
x_train, x_test, y_train, y_test = train_test_split(x, y)
datamodule = NodeFlowDataModule(x_train, y_train, x_test, y_test, split_size=0.8, batch_size=2048)

model_hyperparameters = {
    "num_layers": 3,
    "depth": 2,
    "tree_output_dim": 1,
    "num_trees": 100,
    "flow_hidden_dims": [8, 8]
}
model = NodeFlow(input_dim=10, output_dim=1, **model_hyperparameters)

trainer = Trainer(max_epochs=100, devices=1, enable_checkpointing=False, inference_mode=False,)
trainer.fit(model, datamodule=datamodule)

test_datamodule = NodeFlowDataModule(x_train, y_train, x_test, y_test, split_size=0.8, batch_size=2048)
test_results = trainer.test(model, datamodule=test_datamodule)

samples = trainer.predict(model, datamodule=test_datamodule, return_predictions=True)
samples = np.concatenate(samples).reshape(-1, 1000)

# Plot KDE plot for the first observation, i.e. index = 0.
plt.axvline(x=y_test[0], color='r', label='True value')
sns.kdeplot(samples[0, :], color='blue', label='TreeFlow')
plt.legend()
plt.show()

Code Structure

|── conf/                 # Configuration files
├── data/                 # Datasets and trained models 
├── images/               # Images for documentation purposes
├── logs/                 # Logs generated during experiments execution
├── notebooks/            # Jupyter notebooks for models analysis
├── src/                  # Source code
├── README.md             # This document
└── ...

Source code in src/probabilistic_flow_boosting directory contains the following:

  • models - Models implementations.
  • extras - Kedro wrappers for the TreeFlow model and UCI Datasets.
  • pipelines - Kedro pipelines for the experiments.

Data and Models

The full data folder can be found under the following link: Link. The full trained models can be found under the following link: Link More details regarding the datasets can be found in the paper in the appendix directory.

Experiments

Experiments are created using Kedro framework, and it's definition is in the src/probabilistic_flow_boosting/pipeline_registry.py file. The most important configuration file is in the conf/base/parameters/modeling/nodeflow.yml file. Here you define the hyperparameters for the grid search of the parameters. It needs to be adjusted for each dataset separately.

How to run the Kedro project?

Inside the container run the following command.

python -m kedro run --pipeline <pipeline_name>

where pipeline_name is the name of the pipeline defined in the src/probabilistic_flow_boosting/pipeline_registry.py file.

Citation

If you use this code or the research findings in your work, please cite our paper:

@article{WielopolskiFZ24,
    AUTHOR = {Wielopolski, Patryk and Furman, Oleksii and Zięba, Maciej},
    TITLE = {NodeFlow: Towards End-to-End Flexible Probabilistic Regression on Tabular Data},
    JOURNAL = {Entropy},
    VOLUME = {26},
    YEAR = {2024},
    NUMBER = {7},
    ARTICLE-NUMBER = {593},
    URL = {https://www.mdpi.com/1099-4300/26/7/593},
    PubMedID = {39056955},
    ISSN = {1099-4300},
    DOI = {10.3390/e26070593}
}

Contact

In case of questions or comments please contact using LinkedIn: Patryk Wielopolski