NodeFlow: Towards End-to-end Flexible Probabilistic Regression on Tabular Data

This repository contains the code and resources related to the research paper titled "NodeFlow: Towards End-to-end Flexible Probabilistic Regression on Tabular Data" by Anonymous Authors.

Abstract

We introduce NodeFlow, a flexible framework for probabilistic regression on tabular data that combines Neural Oblivious Decision Ensemble (NODE) and Conditional Continuous Normalizing Flows (CNF). It offers improved modeling capabilities for arbitrary probabilistic distributions, addressing the limitations of traditional parametric approaches. In NodeFlow, the NODE captures complex relationships in tabular data through a tree-like structure, while the conditional CNF utilizes the NODE's output space as a conditioning factor. The training process of NodeFlow employs standard gradient-based learning, facilitating end-to-end optimization of the NODEs and CNF-based density estimation. This approach ensures superior performance, ease of implementation, and scalability, making NodeFlow an appealing choice for practitioners and researchers. Extensive evaluations on benchmark datasets demonstrate the effectiveness of NodeFlow, showcasing its superior accuracy in point estimation and robust uncertainty quantification. Furthermore, ablation studies are conducted to justify the design choices of NodeFlow. In conclusion, NodeFlow's end-to-end training process and strong performance make it a compelling solution for practitioners and researchers. Additionally, it opens new avenues for research and application in the field of probabilistic regression on tabular data.

Prerequisites

Nodeflow and other models in this repository use Pytorch Lightning for training and inference. Furthermore, we used Optuna framework to perform hyperparamter tuning. Full list of requirements is provided in therequirements.in file. Use pip-tools to compile requirements:

python3 -m venv venv
source venv/bin/activate
pip install pip-tools
pip-compile requirements.in
pip install -r requirements.txt
pip install -e .

Getting Started

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from lightning.pytorch import Trainer
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

from probabilistic_flow_boosting.models.nodeflow import NodeFlow, NodeFlowDataModule

x, y = load_diabetes(return_X_y=True)
y = y.reshape(-1, 1)
x_train, x_test, y_train, y_test = train_test_split(x, y)
datamodule = NodeFlowDataModule(x_train, y_train, x_test, y_test, split_size=0.8, batch_size=2048)

model_hyperparameters = {
    "num_layers": 3,
    "depth": 2,
    "tree_output_dim": 1,
    "num_trees": 100,
    "flow_hidden_dims": [8, 8]
}
model = NodeFlow(input_dim=10, output_dim=1, **model_hyperparameters)

trainer = Trainer(max_epochs=100, devices=1, enable_checkpointing=False, inference_mode=False,)
trainer.fit(model, datamodule=datamodule)

test_datamodule = NodeFlowDataModule(x_train, y_train, x_test, y_test, split_size=0.8, batch_size=2048)
test_results = trainer.test(model, datamodule=test_datamodule)

samples = trainer.predict(model, datamodule=test_datamodule, return_predictions=True)
samples = np.concatenate(samples).reshape(-1, 1000)

# Plot KDE plot for the first observation, i.e. index = 0.
plt.axvline(x=y_test[0], color='r', label='True value')
sns.kdeplot(samples[0, :], color='blue', label='TreeFlow')
plt.legend()
plt.show()

Code Structure

|── conf/                 # Configuration files
├── data/                 # Datasets and trained models 
├── images/               # Images for documentation purposes
├── logs/                 # Logs generated during experiments execution
├── notebooks/            # Jupyter notebooks for models analysis
├── src/                  # Source code
├── README.md             # This document
└── ...

Source code in src/probabilistic_flow_boosting directory contains the following:

models - Models implementations.
extras - Kedro wrappers for the TreeFlow model and UCI Datasets.
pipelines - Kedro pipelines for the experiments.

Data and Models

The full data folder can be found under the following link: Link. The full trained models can be found under the following link: Link More details regarding the datasets can be found in the paper in the appendix directory.

Experiments

Experiments are created using Kedro framework, and it's definition is in the src/probabilistic_flow_boosting/pipeline_registry.py file. The most important configuration file is in the conf/base/parameters/modeling/nodeflow.yml file. Here you define the hyperparameters for the grid search of the parameters. It needs to be adjusted for each dataset separately.

How to run the Kedro project?

Inside the container run the following command.

python -m kedro run --pipeline <pipeline_name>

where pipeline_name is the name of the pipeline defined in the src/probabilistic_flow_boosting/pipeline_registry.py file.

Citation

If you use this code or the research findings in your work, please cite our paper:

@article{WielopolskiFZ24,
    AUTHOR = {Wielopolski, Patryk and Furman, Oleksii and Zięba, Maciej},
    TITLE = {NodeFlow: Towards End-to-End Flexible Probabilistic Regression on Tabular Data},
    JOURNAL = {Entropy},
    VOLUME = {26},
    YEAR = {2024},
    NUMBER = {7},
    ARTICLE-NUMBER = {593},
    URL = {https://www.mdpi.com/1099-4300/26/7/593},
    PubMedID = {39056955},
    ISSN = {1099-4300},
    DOI = {10.3390/e26070593}
}

Contact

In case of questions or comments please contact using LinkedIn: Patryk Wielopolski

Name		Name	Last commit message	Last commit date
Latest commit History 167 Commits
.ipython/profile_default		.ipython/profile_default
conf		conf
data		data
docs/source		docs/source
figures		figures
logs		logs
notebooks		notebooks
src		src
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.gs		Dockerfile.gs
Makefile		Makefile
README.md		README.md
calculate_mixed_datasets.py		calculate_mixed_datasets.py
nodeflow_testing.py		nodeflow_testing.py
pyproject.toml		pyproject.toml
requirements.in		requirements.in
requirements.txt		requirements.txt
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NodeFlow: Towards End-to-end Flexible Probabilistic Regression on Tabular Data

Abstract

Table of Contents

Prerequisites

Getting Started

Code Structure

Data and Models

Experiments

How to run the Kedro project?

Citation

Contact

About

Contributors 2

Languages

pfilo8/NodeFlow

Folders and files

Latest commit

History

Repository files navigation

NodeFlow: Towards End-to-end Flexible Probabilistic Regression on Tabular Data

Abstract

Table of Contents

Prerequisites

Getting Started

Code Structure

Data and Models

Experiments

How to run the Kedro project?

Citation

Contact

About

Topics

Resources

Stars

Watchers

Forks

Contributors 2

Languages