Production-grade feature engineering pipeline for financial time series. Transforms raw OHLCV data into 77 stationary, ML-ready features across 6 signal families, with full data quality validation, anti-leakage guarantees, and automated feature selection.
- Overview
- Architecture
- Stack
- Results
- Feature Catalog
- Key Design Decisions
- Installation
- Usage
- Testing
- Project Structure
- Outputs
This project solves one of the most critical and underestimated problems in quantitative machine learning: turning raw market data into predictive features that are stable, stationary, and leak-free.
Most practitioners make the same three mistakes:
| Common Mistake | This Project's Solution |
|---|---|
| Raw prices as features | All price-based features normalized as ratios or distances |
Random train_test_split |
TimeSeriesSplit enforced in all model-based evaluations |
| Removing high-VIF features | VIF used as diagnostic, not auto-pruner; interaction features added |
The pipeline processes 4 assets (BTC-USD, SPY, AAPL, GC=F) across 6 years of daily bars, producing 77 stationary features per asset, saved as compressed Parquet files for downstream ML training.
┌─────────────────────────────────────────────────────────────────────────┐
│ FEATURE ENGINEERING PIPELINE │
│ │
│ ┌──────────────┐ ┌──────────────────┐ ┌────────────────────┐ │
│ │ DATA LAYER │ │ FEATURE LAYER │ │ SELECTION LAYER │ │
│ │ │ │ │ │ │ │
│ │ yfinance │───▶│ LagTransformer │ │ Pearson/Spearman │ │
│ │ 4 symbols │ │ TechIndicators │───▶│ VIF Diagnostics │ │
│ │ 2019-2024 │ │ RollingWindows │ │ RF Importance │ │
│ │ │ │ Volatility │ │ Mutual Info │ │
│ │ Validator │ │ Temporal │ │ TimeSeriesSplit │ │
│ │ (OHLC/gaps/ │ │ Interactions │ │ │ │
│ │ flatlines/ │ │ │ │ Top-20 selector │ │
│ │ zero-vol) │ │ 77 features │ │ (avg_rank strat.) │ │
│ └──────────────┘ └──────────────────┘ └────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ data/raw/*.parquet data/features/ outputs/figures/ │
│ *_features.parquet outputs/reports/ │
└─────────────────────────────────────────────────────────────────────────┘
┌─────────────┐ ┌─────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────┐
│ Lag │ │ Technical │ │ Rolling │ │ Volatility │ │ Temporal │
│ (no deps) │ │ (no deps) │ │ (no deps) │ │ (no deps) │ │ (no deps)│
└──────┬──────┘ └──────┬──────┘ └──────┬───────┘ └──────┬───────┘ └────┬─────┘
│ │ │ │ │
└────────────────┴────────────────┴──────────────────┴───────────────┘
│
▼
┌──────────────────┐
│ Interaction │
│ (depends on all) │
└──────────────────┘
| Library | Version | Role |
|---|---|---|
pandas |
≥ 2.1 | Core data manipulation |
numpy |
≥ 1.26 | Numerical computation |
yfinance |
≥ 0.2.40 | OHLCV data source |
pandas-ta |
≥ 0.3.14b0 | Technical indicators (pure Python — no C compile) |
scikit-learn |
≥ 1.4 | Pipeline interface + feature selection |
scipy |
≥ 1.12 | Statistical tests, distributions |
statsmodels |
≥ 0.14 | VIF computation |
pyarrow |
≥ 15.0 | Parquet I/O (columnar, compressed) |
loguru |
≥ 0.7.2 | Structured logging |
pydantic |
≥ 2.6 | Configuration validation |
matplotlib / seaborn |
≥ 3.8 / 0.13 | Publication-ready visualization |
pytest |
≥ 8.0 | 42-test suite |
Why
pandas-tainstead ofTA-Lib? TA-Lib requires compiling C extensions that frequently fail on Windows and different Python versions.pandas-tais pure Python, installs with a singlepip install, and covers the same indicator surface area.
PIPELINE RUN SUMMARY
══════════════════════════════════════════════════════════════
Symbol Raw rows Clean rows Features Date range
──────────────────────────────────────────────────────────────
BTC-USD 2,191 2,153 77 2019-02-03 → 2024-12-25
SPY 1,509 1,471 77 2019-02-20 → 2024-12-20
AAPL 1,509 1,471 77 2019-02-20 → 2024-12-20
GC=F 1,509 1,471 77 2019-02-20 → 2024-12-20
══════════════════════════════════════════════════════════════
TOTAL 6,718 6,566 77 6 years of daily bars
╔══ BTC-USD ══════════════════════════════════════
║ Rows: 2,191 → 2,153 (after cleaning)
║ Nulls remaining: 0
║ Time gaps: 0
║ Flatlines: 0 periods
║ Zero volume: 0
║ OHLC violations: 0
║ Outlier rows: 1 ← COVID flash crash, kept as-is
╚══ Status: ✓ CLEAN
╔══ SPY / AAPL ═══════════════════════════════════
║ Rows: 1,509 → 1,471 (after cleaning)
║ Time gaps: 55 ← US market holidays (expected)
║ All other checks: 0
╚══ Status: ✓ CLEAN
╔══ GC=F (Gold Futures) ══════════════════════════
║ Rows: 1,509 → 1,471 (after cleaning)
║ Zero volume: 12 ← Exchange downtime, forward-filled
║ Time gaps: 55 ← Market holidays (expected)
╚══ Status: ✓ CLEAN
Family Count Description
────────────────────────────────────────────────────────────
Lag 10 Return lags + volume ratios + momentum
Technical 20 RSI, MACD, Bollinger, ATR, CCI, Stoch, OBV, ADX
Rolling 23 dist_to_sma, z-score, range position (all windows)
Volatility 12 Close-to-Close + Parkinson + Garman-Klass
Temporal 9 Sin/cos cyclic + calendar binary flags
Interaction 6 Cross-domain composite signals
────────────────────────────────────────────────────────────
TOTAL 77 + 4 target columns + 5 OHLCV = 87 total columns
data/features/
├── BTC-USD_features.parquet 1,534 KB (snappy compressed)
├── SPY_features.parquet 1,065 KB
├── AAPL_features.parquet 1,066 KB
└── GC_F_features.parquet 1,042 KB
pytest tests/ -v
═══════════════════════════════════════════════════════════════
42 tests | 0 failed | 0 errors | 3.57s
═══════════════════════════════════════════════════════════════
tests/test_features.py
TestLagFeatureTransformer 7 / 7 PASSED
TestRollingWindowTransformer 6 / 6 PASSED
TestTechnicalIndicatorTransformer 6 / 6 PASSED
TestTemporalFeatureTransformer 4 / 4 PASSED
TestVolatilityTransformer 6 / 6 PASSED
TestInteractionFeatureTransformer 3 / 3 PASSED
tests/test_pipeline.py
TestTimeSeriesValidator 5 / 5 PASSED
TestPipelineFeatures 4 / 4 PASSED
TestNoDataLeakage 1 / 1 PASSED ← Critical
| Feature | Formula | Stationarity |
|---|---|---|
return_lag_Nd |
Close_t / Close_{t-N} - 1 |
✓ % change |
volume_ratio_lag_Nd |
Vol_{t-N} / SMA20(Vol)_{t-N} - 1 |
✓ relative ratio |
momentum_1_21 |
return_lag_1d - return_lag_21d |
✓ difference of returns |
momentum_regime |
dist_to_sma_5 - dist_to_sma_50 |
✓ difference of ratios |
N ∈ {1, 2, 3, 5, 10, 21}
| Feature | Range | Notes |
|---|---|---|
rsi_14, rsi_28 |
[0, 100] | Bounded by construction |
rsi_{N}_centered |
[-50, +50] | RSI shifted to center at 0 |
macd_normalized |
~(-0.05, 0.05) | MACD line / Close — scale-free |
macd_signal_normalized |
~(-0.05, 0.05) | Signal line / Close |
macd_hist_normalized |
~(-0.02, 0.02) | Histogram / Close |
macd_hist_sign |
{-1, 0, +1} | Direction of momentum |
bb_pct_b |
[~0, ~1] | Bollinger % position |
bb_width_normalized |
(0, ∞) | (Upper−Lower) / Mid |
atr_pct |
(0, 0.10] | ATR / Close — volatility % |
cci_20 |
~[-200, +200] | Commodity Channel Index |
stoch_k, stoch_d |
[0, 100] | Stochastic oscillators |
stoch_kd_diff |
[-100, +100] | K − D crossover signal |
obv_normalized |
~(-3, +3) | OBV z-score (20-bar window) |
williams_r |
[-100, 0] | Williams %R |
adx_14 |
[0, 100] | Trend strength |
adx_directional_ratio |
[-1, +1] | (DM+ − DM−) / (DM+ + DM−) |
rsi_trend_divergence |
unbounded | (RSI−50) × dist_to_sma_20 |
bb_rsi_composite |
[~0, ~1] | bb_pct_b × (RSI / 100) |
Per window W ∈ {5, 10, 20, 50}:
| Feature | Formula | Why not raw? |
|---|---|---|
dist_to_sma_W |
(Close / SMA_W) − 1 |
Raw SMA grows with price (non-stationary) |
rolling_std_returns_W |
std(log_ret, W) |
Returns already stationary |
zscore_W |
(Close − SMA) / std(Close, W) |
Standardized price position |
dist_to_high_W |
(Close / max(Close, W)) − 1 |
Always ≤ 0 |
dist_to_low_W |
(Close / min(Close, W)) − 1 |
Always ≥ 0 |
Plus cross-window:
rolling_skew_20,rolling_kurt_20— return distribution shapetrend_acceleration—dist_to_sma_5 − dist_to_sma_20
Three estimators at windows W ∈ {5, 10, 20}, all annualized:
| Estimator | Formula | Efficiency vs C-to-C |
|---|---|---|
hist_vol_Wd |
std(log_ret, W) × √252 |
1× (baseline) |
parkinson_vol_Wd |
√(1/4ln2 × E[(ln H/L)²]) × √252 |
~5× more efficient |
garman_klass_vol_Wd |
√(0.5(ln H/L)² − (2ln2−1)(ln C/O)²) × √252 |
~8× more efficient |
Plus:
high_low_pct—(High − Low) / Closevol_of_vol— std ofhist_vol_20dover 20 barsvol_regime—hist_vol_20d / SMA_90(hist_vol_20d)
| Feature | Encoding | Period | Why not integer? |
|---|---|---|---|
dow_sin, dow_cos |
sin/cos | 5 (biz days) | Mon(0)↔Fri(4) must be neighbors |
month_sin, month_cos |
sin/cos | 12 | Dec↔Jan must be neighbors |
quarter_sin, quarter_cos |
sin/cos | 4 | Q4↔Q1 must be neighbors |
is_month_start |
binary | — | Turn-of-month liquidity effect |
is_month_end |
binary | — | Window-dressing effect |
is_quarter_end |
binary | — | Institutional rebalancing |
"Instead of removing correlated features, add interaction features. The divergence between two correlated signals is often the most predictive part."
| Feature | Formula | Economic Intuition |
|---|---|---|
trend_over_vol |
dist_to_sma_20 / hist_vol_20d |
Strong trend in calm market = most reliable |
rsi_trend_divergence |
(RSI−50) × dist_to_sma_20 |
Overbought + negative trend → reversal |
momentum_regime |
dist_to_sma_5 − dist_to_sma_50 |
Bull/bear regime indicator |
vol_adj_return_1d |
return_lag_1d / hist_vol_20d |
Bar-level Sharpe ratio |
bb_rsi_composite |
bb_pct_b × (RSI / 100) |
Dual-confirmation extreme signal |
trend_volume_confirm |
dist_to_sma_20 × volume_ratio_lag_1d |
Volume-confirmed trend |
| Column | Type | Formula |
|---|---|---|
forward_return_1d |
Continuous | log(Close_{t+1} / Close_t) |
forward_return_5d |
Continuous | log(Close_{t+5} / Close_t) |
direction_1d |
Binary {0,1} | int(forward_return_1d > 0) |
direction_5d |
Binary {0,1} | int(forward_return_5d > 0) |
Data leakage prevention: targets are computed via
shift(-n)and added after all features. They are explicitly excluded fromfeature_colsby_get_feature_cols(). The anti-leakage integration test validates this.
# ❌ WRONG — non-stationary, causes distribution shift
rolling_mean_20 = Close.rolling(20).mean()
# ✓ CORRECT — stationary, comparable across all time
dist_to_sma_20 = (Close / rolling_mean_20) - 1BTC traded at ~$10k in 2020 and ~$60k in 2024. A model trained on rolling means from 2020 would be useless when evaluated on 2024 data. All features in this pipeline are expressed as ratios, percentages, or z-scores.
Random split (WRONG for time series):
Train: [t5, t10, t1, t8, ...] → future leaks into training
TimeSeriesSplit (CORRECT):
Fold 1: Train=t1..t200, Val=t201..t250
Fold 2: Train=t1..t250, Val=t251..t300
Fold 3: Train=t1..t300, Val=t301..t350
Used in: Random Forest importance, Mutual Information cross-validation.
High VIF between EMA_20 and SMA_20 (~0.99 correlation) does not mean one should be removed. Their 1% divergence — the spread between exponential and simple smoothing — often carries the most significant signal. This pipeline uses VIF to flag areas for investigation and guide the construction of new interaction features.
| Estimator | Uses | Best when |
|---|---|---|
| Close-to-Close | Daily returns | Standard benchmark |
| Parkinson (1980) | High-Low range | Drift ≈ 0, gaps rare |
| Garman-Klass (1980) | OHLC prices | General case (~8× efficient) |
Using all three provides redundancy and lets the model learn which estimator is more reliable in each market regime.
# ❌ Integer encoding breaks the cycle
day_of_week = 0 # Monday
day_of_week = 4 # Friday
# Monday(0) and Friday(4) appear distant, but they're adjacent in market time
# ✓ Sin/cos preserves the cycle
dow_sin = sin(2π × dayofweek / 5)
dow_cos = cos(2π × dayofweek / 5)
# Monday and Friday are equidistant from each other as from any other pair# 1. Clone the repository
git clone https://github.com/your-user/alpha-feature-engineering-timeseries.git
cd alpha-feature-engineering-timeseries
# 2. Create and activate virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate # Linux / macOS
.venv\Scripts\activate # Windows
# 3. Install dependencies
pip install -r requirements.txtPython version: 3.11+ recommended.
No C compilation required — pandas-ta is pure Python.
# Default: use cache if available, run feature selection
python run_pipeline.py
# Re-download all data (ignore cache)
python run_pipeline.py --force-download
# Skip feature selection (faster, for debugging)
python run_pipeline.py --no-selection
# Use a custom config file
python run_pipeline.py --config config/my_config.yamlfrom src.pipeline import FeatureEngineeringPipeline
# From config file
pipeline = FeatureEngineeringPipeline.from_config("config/config.yaml")
results = pipeline.run()
# Access a specific asset's feature DataFrame
btc_features = results["BTC-USD"]
print(btc_features.shape) # (2153, 87)import pandas as pd
from src.features.technical import TechnicalIndicatorTransformer
df = pd.read_parquet("data/raw/BTC-USD.parquet")
transformer = TechnicalIndicatorTransformer(rsi_lengths=[14, 28])
df_with_features = transformer.fit_transform(df)
print(df_with_features[["rsi_14", "atr_pct", "bb_pct_b"]].tail())import pandas as pd
btc = pd.read_parquet("data/features/BTC-USD_features.parquet")
# Separate features from targets
OHLCV = {"Open", "High", "Low", "Close", "Volume"}
TARGETS = {c for c in btc.columns if "forward_return" in c or "direction" in c}
X = btc[[c for c in btc.columns if c not in OHLCV | TARGETS]]
y = btc["forward_return_1d"]All parameters are centralized in config/config.yaml:
data:
symbols: ["BTC-USD", "SPY", "AAPL", "GC=F"]
start_date: "2019-01-01"
end_date: "2024-12-31"
interval: "1d"
features:
lag:
periods: [1, 2, 3, 5, 10, 21]
rolling:
windows: [5, 10, 20, 50]
volatility:
windows: [5, 10, 20]
target:
forward_periods: [1, 5]# Run full test suite
pytest tests/ -v
# Run with coverage report
pytest tests/ --cov=src --cov-report=term-missing
# Run a single test class
pytest tests/test_pipeline.py::TestNoDataLeakage -v| Class | Tests | What is validated |
|---|---|---|
TestLagFeatureTransformer |
7 | Columns, stationarity, look-ahead bias |
TestRollingWindowTransformer |
6 | Ratio encoding, NaN position, bounds |
TestTechnicalIndicatorTransformer |
6 | RSI bounds, ATR scale, MACD normalization |
TestTemporalFeatureTransformer |
4 | Sin/cos range, zero NaNs, binary flags |
TestVolatilityTransformer |
6 | Non-negative vol, regime mean, Parkinson |
TestInteractionFeatureTransformer |
3 | Prerequisites, output presence |
TestTimeSeriesValidator |
5 | Zero-vol, flatlines, forward-fill |
TestPipelineFeatures |
4 | Target isolation, 50+ features, OHLCV preserved |
TestNoDataLeakage |
1 | Critical: features identical after truncation |
The most important test in the project:
def test_features_unchanged_after_truncation(self):
"""
Remove the last 20 rows of the dataset.
Feature values for rows t1..t_{n-20} must be IDENTICAL.
Any feature that uses future data would fail this test.
"""
df_full = pipeline._build_features(df) # n rows
df_trunc = pipeline._build_features(df.iloc[:-20]) # n-20 rows
pd.testing.assert_frame_equal(
df_full[common_cols].iloc[60:n-20],
df_trunc[common_cols].iloc[60:n-20],
rtol=1e-5,
)alpha-feature-engineering-timeseries/
│
├── config/
│ └── config.yaml ← All pipeline parameters
│
├── data/
│ ├── raw/ ← Raw OHLCV Parquet (cache)
│ ├── processed/ ← Validated OHLCV (intermediate)
│ └── features/ ← Final feature datasets (output)
│ ├── BTC-USD_features.parquet
│ ├── SPY_features.parquet
│ ├── AAPL_features.parquet
│ └── GC_F_features.parquet
│
├── src/
│ ├── data/
│ │ ├── loader.py ← yfinance download + retry + cache
│ │ └── validator.py ← OHLC/flatline/gap/volume checks
│ │
│ ├── features/
│ │ ├── base.py ← BaseFeatureTransformer (abstract)
│ │ ├── lag.py ← Return lags, volume ratios
│ │ ├── technical.py ← RSI, MACD, BB, ATR, CCI, Stoch, ADX
│ │ ├── rolling.py ← dist_to_sma, z-score, range position
│ │ ├── volatility.py ← C-to-C, Parkinson, Garman-Klass
│ │ ├── temporal.py ← Sin/cos cyclic calendar encoding
│ │ └── interaction.py ← Cross-domain composite signals
│ │
│ ├── selection/
│ │ ├── correlation.py ← Pearson/Spearman, VIF, heatmap
│ │ └── importance.py ← RF + MI with TimeSeriesSplit
│ │
│ └── pipeline.py ← End-to-end orchestrator
│
├── notebooks/
│ ├── 01_data_exploration.ipynb ← ADF test, distributions, validation
│ ├── 02_feature_engineering.ipynb ← Full pipeline walkthrough
│ └── 03_feature_analysis.ipynb ← Heatmap, VIF, RF/MI importance
│
├── tests/
│ ├── conftest.py ← Shared make_ohlcv() fixture
│ ├── test_features.py ← 32 unit tests per transformer family
│ └── test_pipeline.py ← 10 integration + anti-leakage tests
│
├── outputs/
│ ├── figures/ ← correlation_heatmap.png, feature_importance.png
│ └── reports/ ← target_correlations.csv, rf_importance.csv
│
├── run_pipeline.py ← CLI entry point
├── pytest.ini ← Test configuration
└── requirements.txt
After a full run (python run_pipeline.py), the following files are generated:
| Path | Format | Contents |
|---|---|---|
data/features/*_features.parquet |
Parquet | 77 features + 4 targets + 5 OHLCV |
outputs/figures/correlation_heatmap.png |
PNG | Top-30 feature correlation matrix |
outputs/figures/feature_importance.png |
PNG | RF + MI side-by-side bar charts |
outputs/figures/03_spearman_correlations.png |
PNG | Spearman ρ rankings |
outputs/reports/target_correlations.csv |
CSV | Pearson + Spearman with p-values |
outputs/reports/rf_importance_regression.csv |
CSV | RF importance (mean ± std, 5 folds) |
outputs/reports/rf_importance_classification.csv |
CSV | RF importance for direction target |
outputs/reports/mutual_information.csv |
CSV | MI scores for all features |
outputs/reports/high_correlation_pairs.csv |
CSV | Feature pairs with |corr| ≥ 0.95 |
outputs/reports/{symbol}_summary.json |
JSON | Run metadata + top-20 features |
outputs/reports/pipeline.log |
Log | Structured loguru output |
- Parkinson, M. (1980). The Extreme Value Method for Estimating the Variance of the Rate of Return. The Journal of Business.
- Garman, M. & Klass, M. (1980). On the Estimation of Security Price Volatilities. The Journal of Business.
- López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley.
MIT License — see LICENSE for details.
Built with precision. Validated with rigor. Documented for production.