Skip to content

thomastschinkel/autofepg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

54 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ§ͺ AutoFE-PG

Automatic Feature Engineering & Selection for Kaggle Playground Competitions

Python 3.8+ License: MIT Version

AutoFE-PG is a powerful library that automatically generates, evaluates, and selects engineered features to boost your tabular ML models β€” with zero target leakage.

Version 0.3.0 is a complete refactoring focused on general-purpose strategies that work across any tabular competition, featuring advanced binning, digit-based features, Cyclical encoding, Weight of Evidence, and Genetic Programming interactions.


✨ Key Features

Feature Description
Genetic Programming Generates complex non-linear interactions using gplearn
Digit-Based Logic Extracts integer and decimal positions; creates digit-cross-category interactions
Target Representation OOF Target Aggregation (mean, std, skew), WoE, and Entropy features
Cyclical Encoding Sine/Cosine transformations for periodic numerical features
Advanced Binning Both Quantile (qcut) and Equal-width (cut) discretization
External Signal Injection Inject historical Priors, WoE, and Entropy from original datasets
Zero Target Leakage All target-dependent features use strict out-of-fold (OOF) strategies
Greedy Selection Forward selection keeps only features that improve CV score
GPU Acceleration Built-in support for XGBoost GPU engines

πŸš€ Quick Start

Installation

pip install autofepg
# Optional: for Genetic Programming features
pip install gplearn

Basic Usage

import pandas as pd
from autofepg import select_features

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

X_train = train.drop(columns=["id", "target"])
y_train = train["target"]
X_test = test.drop(columns=["id"])

result = select_features(
    X_train, y_train, X_test,
    task="classification",
    time_budget=3600  # 1 hour limit
)

X_train_new = result["X_train"]
X_test_new = result["X_test"]

print(f"Features added: {len(result['selected_features'])}")
print(f"CV Improvement: {result['base_score']:.6f} -> {result['best_score']:.6f}")

Injecting Historical Signals (Original Data)

If you have access to a "real world" dataset (common in Kaggle Playground synthetic competitions), you can inject its signals without leakage:

result = select_features(
    X_train, y_train, X_test,
    original_df=original_df,
    original_target=original_target,
    task="classification"
)

πŸ“– Feature Strategies (v0.3.0)

1. Digits & Discretization

  • Digit Extraction: Integer positions (units, tens, etc.) and decimal positions.
  • Digit Interactions: Column-wise and cross-column interactions between digits.
  • Binning: Discretize continuous variables via Quantile (qcut) or Equal-width (cut) bins.
  • Rounding: Rounding to various decimal places or magnitudes to find structural splits.

2. Specialized Encoding

  • Cyclical Encoding: Sin/Cos transforms for periodic data.
  • Target Encoding (OOF): Out-of-fold mean target per category.
  • Weight of Evidence (WoE): OOF WoE scores for binary classification.
  • Entropy: OOF target entropy per value group.
  • OOF Aggregation: Mean, Std, and Skew of the target grouped by feature values.

3. Non-Linear Interactions

  • Genetic Programming: Evolves mathematical expressions using the base features (requires gplearn).
  • Pair Interactions: Categorical label-encoding of bigrams.
  • Numerical Products: NaN-safe products of bigram numerical features.
  • Digit Γ— Category: Target encoding on the interaction of a column's digit and another category.

4. External Data Signals

  • Bayesian Priors: Historical P(target|value) from the original dataset.
  • External WoE: WoE scores pre-computed from the original dataset.
  • External Entropy: Group purity/impurity derived from the original dataset.

βš™οΈ Configuration

Parameter Default Description
task "auto" "classification", "regression", or "auto"
n_folds 5 Number of CV folds for evaluation
time_budget None Max wall-clock seconds for the search
improvement_threshold 1e-7 Min score delta to keep a feature
sample None Rows to sample for evaluation (speeds up search)
gp_generations 5 Evolution steps for Genetic Programming
gp_n_components 5 Max GP features to potentially keep
original_df None External dataset for Priors/WoE/Entropy

πŸ“ Changelog

v0.3.0 (Current)

  • Refactoring: Removed competition-specific features (Domain Alignment, Dataset Frequency, Rarity).
  • New Features: Cyclical Features, OOF/External WoE, OOF/External Entropy, Genetic Programming (gplearn).
  • Enhanced Digits: Added Decimal Digit extraction.
  • Enhanced Aggregation: Added Skewness support to OOF Target Aggregation.
  • Simplified API: Decoupled from specific dataset patterns; focused on universal engineering.

v0.2.0

  • Added original dataset support (Domain Alignment, Bayesian Priors).
  • Introduced Cross-Dataset Frequency and Rarity features.

πŸ“„ License

MIT License β€” Copyright (c) 2026 Thomas Tschinkel.