pretab is a modular, extensible, and scikit-learn-compatible preprocessing library for tabular data. It supports all sklearn transformers out of the box, and extends functionality with a rich set of custom encoders, splines, and neural basis expansions.
-
🔢 Numerical preprocessing via:
- Polynomial and spline expansions:
B-splines,natural cubic splines,thin plate splines,tensor product splines,P-splines - Neural-inspired basis:
RBF,ReLU,Sigmoid,Tanh - Custom binning: rule-based or tree-based
- Piecewise Linear Encoding (
PLE)
- Polynomial and spline expansions:
-
🌤 Categorical preprocessing:
- Ordinal encodings
- One-hot encodings
- Language embeddings (
pretrained vectorizers) - Custom encoders like
OneHotFromOrdinalTransformer
-
🔧 Composable pipeline interface:
- Fully compatible with
sklearn.pipeline.Pipelineandsklearn.compose.ColumnTransformer - Accepts all sklearn-native transformers and parameters seamlessly
- Fully compatible with
-
🧠 Smart preprocessing:
- Automatically detects feature types (categorical vs numerical)
- Supports both
pandas.DataFrameandnumpy.ndarrayinputs
-
🧪 Comprehensive test coverage
-
🤝 Community-driven and open to contributions
Install via pip:
pip install pretabOr install in editable mode for development:
git clone https://github.com/OpenTabular/pretab.git
cd pretab
pip install -e .import pandas as pd
import numpy as np
from pretab.preprocessor import Preprocessor
# Simulated tabular dataset
df = pd.DataFrame({
"age": np.random.randint(18, 65, size=100),
"income": np.random.normal(60000, 15000, size=100).astype(int),
"job": np.random.choice(["nurse", "engineer", "scientist", "teacher", "artist", "manager"], size=100),
"city": np.random.choice(["Berlin", "Munich", "Hamburg", "Cologne"], size=100),
"experience": np.random.randint(0, 40, size=100)
})
y = np.random.randn(100, 1)
# Optional feature-specific preprocessing config
config = {
"age": "ple",
"income": "rbf",
"experience": "quantile",
"job": "one-hot",
"city": "none"
}
# Initialize Preprocessor
preprocessor = Preprocessor(
feature_preprocessing=config,
task="regression"
)
# Fit and transform the data into a dictionary of feature arrays
X_dict = preprocessor.fit_transform(df, y)
# Optionally get a stacked array instead of a dictionary
X_array = preprocessor.transform(df, return_array=True)
# Get feature metadata
preprocessor.get_feature_info(verbose=True)pretab includes both sklearn-native and custom-built transformers:
CubicSplineTransformerNaturalCubicSplineTransformerPSplineTransformerTensorProductSplineTransformerThinPlateSplineTransformer
RBFExpansionTransformerReLUExpansionTransformerSigmoidExpansionTransformerTanhExpansionTransformer
PLETransformerCustomBinTransformerOneHotFromOrdinalTransformerContinuousOrdinalTransformerLanguageEmbeddingTransformer
NoTransformerToFloatTransformer
Plus: any
sklearntransformer can be passed directly with full support for hyperparameters.
Using the transformers follows the standard sklearn.preprocessing steps. I.e. using PLE
import numpy as np
from pretab.transformers import PLETransformer
x = np.random.randn(100, 1)
y = np.random.randn(100, 1)
x_ple = PLETransformer(n_bins=15, task="regression").fit_transform(x, y)
assert x_ple.shape[1] == 15For splines, the penalty matrices can be extracted via .get_penalty_matrix()
import numpy as np
from pretab.transformers import ThinPlateSplineTransformer
x = np.random.randn(100, 1)
tp = ThinPlateSplineTransformer(n_basis=15)
x_tp = tp.fit_transform(x)
assert x_tp.shape[1] == 15
penalty = tp.get_penalty_matrix()pytest --maxfail=2 --disable-warnings -vpretab is community-driven! Whether you’re fixing bugs, adding new encoders, or improving the docs — contributions are welcome.
git clone https://github.com/OpenTabular/pretab.git
cd pretab
pip install -e ".[dev]"Then create a pull request 🚀
MIT License. See LICENSE for details.
pretab builds on the strengths of: