Automatic Feature Engineering & Selection for Kaggle Playground Competitions
AutoFE-PG is a powerful library that automatically generates, evaluates, and selects engineered features to boost your tabular ML models β with zero target leakage.
Version 0.3.0 is a complete refactoring focused on general-purpose strategies that work across any tabular competition, featuring advanced binning, digit-based features, Cyclical encoding, Weight of Evidence, and Genetic Programming interactions.
| Feature | Description |
|---|---|
| Genetic Programming | Generates complex non-linear interactions using gplearn |
| Digit-Based Logic | Extracts integer and decimal positions; creates digit-cross-category interactions |
| Target Representation | OOF Target Aggregation (mean, std, skew), WoE, and Entropy features |
| Cyclical Encoding | Sine/Cosine transformations for periodic numerical features |
| Advanced Binning | Both Quantile (qcut) and Equal-width (cut) discretization |
| External Signal Injection | Inject historical Priors, WoE, and Entropy from original datasets |
| Zero Target Leakage | All target-dependent features use strict out-of-fold (OOF) strategies |
| Greedy Selection | Forward selection keeps only features that improve CV score |
| GPU Acceleration | Built-in support for XGBoost GPU engines |
pip install autofepg
# Optional: for Genetic Programming features
pip install gplearnimport pandas as pd
from autofepg import select_features
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
X_train = train.drop(columns=["id", "target"])
y_train = train["target"]
X_test = test.drop(columns=["id"])
result = select_features(
X_train, y_train, X_test,
task="classification",
time_budget=3600 # 1 hour limit
)
X_train_new = result["X_train"]
X_test_new = result["X_test"]
print(f"Features added: {len(result['selected_features'])}")
print(f"CV Improvement: {result['base_score']:.6f} -> {result['best_score']:.6f}")If you have access to a "real world" dataset (common in Kaggle Playground synthetic competitions), you can inject its signals without leakage:
result = select_features(
X_train, y_train, X_test,
original_df=original_df,
original_target=original_target,
task="classification"
)- Digit Extraction: Integer positions (units, tens, etc.) and decimal positions.
- Digit Interactions: Column-wise and cross-column interactions between digits.
- Binning: Discretize continuous variables via Quantile (qcut) or Equal-width (cut) bins.
- Rounding: Rounding to various decimal places or magnitudes to find structural splits.
- Cyclical Encoding: Sin/Cos transforms for periodic data.
- Target Encoding (OOF): Out-of-fold mean target per category.
- Weight of Evidence (WoE): OOF WoE scores for binary classification.
- Entropy: OOF target entropy per value group.
- OOF Aggregation: Mean, Std, and Skew of the target grouped by feature values.
- Genetic Programming: Evolves mathematical expressions using the base features (requires
gplearn). - Pair Interactions: Categorical label-encoding of bigrams.
- Numerical Products: NaN-safe products of bigram numerical features.
- Digit Γ Category: Target encoding on the interaction of a column's digit and another category.
- Bayesian Priors: Historical
P(target|value)from the original dataset. - External WoE: WoE scores pre-computed from the original dataset.
- External Entropy: Group purity/impurity derived from the original dataset.
| Parameter | Default | Description |
|---|---|---|
task |
"auto" |
"classification", "regression", or "auto" |
n_folds |
5 |
Number of CV folds for evaluation |
time_budget |
None |
Max wall-clock seconds for the search |
improvement_threshold |
1e-7 |
Min score delta to keep a feature |
sample |
None |
Rows to sample for evaluation (speeds up search) |
gp_generations |
5 |
Evolution steps for Genetic Programming |
gp_n_components |
5 |
Max GP features to potentially keep |
original_df |
None |
External dataset for Priors/WoE/Entropy |
- Refactoring: Removed competition-specific features (Domain Alignment, Dataset Frequency, Rarity).
- New Features: Cyclical Features, OOF/External WoE, OOF/External Entropy, Genetic Programming (gplearn).
- Enhanced Digits: Added Decimal Digit extraction.
- Enhanced Aggregation: Added Skewness support to OOF Target Aggregation.
- Simplified API: Decoupled from specific dataset patterns; focused on universal engineering.
- Added original dataset support (Domain Alignment, Bayesian Priors).
- Introduced Cross-Dataset Frequency and Rarity features.
MIT License β Copyright (c) 2026 Thomas Tschinkel.