Alpha Feature Engineering — Time Series

Production-grade feature engineering pipeline for financial time series. Transforms raw OHLCV data into 77 stationary, ML-ready features across 6 signal families, with full data quality validation, anti-leakage guarantees, and automated feature selection.

Overview

This project solves one of the most critical and underestimated problems in quantitative machine learning: turning raw market data into predictive features that are stable, stationary, and leak-free.

Most practitioners make the same three mistakes:

Common Mistake	This Project's Solution
Raw prices as features	All price-based features normalized as ratios or distances
Random `train_test_split`	`TimeSeriesSplit` enforced in all model-based evaluations
Removing high-VIF features	VIF used as diagnostic, not auto-pruner; interaction features added

The pipeline processes 4 assets (BTC-USD, SPY, AAPL, GC=F) across 6 years of daily bars, producing 77 stationary features per asset, saved as compressed Parquet files for downstream ML training.

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                    FEATURE ENGINEERING PIPELINE                         │
│                                                                         │
│  ┌──────────────┐    ┌──────────────────┐    ┌────────────────────┐    │
│  │  DATA LAYER  │    │  FEATURE LAYER   │    │  SELECTION LAYER   │    │
│  │              │    │                  │    │                    │    │
│  │  yfinance    │───▶│  LagTransformer  │    │  Pearson/Spearman  │    │
│  │  4 symbols   │    │  TechIndicators  │───▶│  VIF Diagnostics   │    │
│  │  2019-2024   │    │  RollingWindows  │    │  RF Importance     │    │
│  │              │    │  Volatility      │    │  Mutual Info       │    │
│  │  Validator   │    │  Temporal        │    │  TimeSeriesSplit   │    │
│  │  (OHLC/gaps/ │    │  Interactions    │    │                    │    │
│  │  flatlines/  │    │                  │    │  Top-20 selector   │    │
│  │  zero-vol)   │    │  77 features     │    │  (avg_rank strat.) │    │
│  └──────────────┘    └──────────────────┘    └────────────────────┘    │
│         │                    │                         │               │
│         ▼                    ▼                         ▼               │
│   data/raw/*.parquet   data/features/          outputs/figures/        │
│                        *_features.parquet      outputs/reports/        │
└─────────────────────────────────────────────────────────────────────────┘

Transformer dependency graph

┌─────────────┐  ┌─────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────┐
│     Lag     │  │  Technical  │  │   Rolling    │  │  Volatility  │  │ Temporal │
│  (no deps)  │  │  (no deps)  │  │  (no deps)   │  │  (no deps)   │  │ (no deps)│
└──────┬──────┘  └──────┬──────┘  └──────┬───────┘  └──────┬───────┘  └────┬─────┘
       │                │                │                  │               │
       └────────────────┴────────────────┴──────────────────┴───────────────┘
                                         │
                                         ▼
                               ┌──────────────────┐
                               │   Interaction    │
                               │ (depends on all) │
                               └──────────────────┘

Stack

Library	Version	Role
`pandas`	≥ 2.1	Core data manipulation
`numpy`	≥ 1.26	Numerical computation
`yfinance`	≥ 0.2.40	OHLCV data source
`pandas-ta`	≥ 0.3.14b0	Technical indicators (pure Python — no C compile)
`scikit-learn`	≥ 1.4	Pipeline interface + feature selection
`scipy`	≥ 1.12	Statistical tests, distributions
`statsmodels`	≥ 0.14	VIF computation
`pyarrow`	≥ 15.0	Parquet I/O (columnar, compressed)
`loguru`	≥ 0.7.2	Structured logging
`pydantic`	≥ 2.6	Configuration validation
`matplotlib` / `seaborn`	≥ 3.8 / 0.13	Publication-ready visualization
`pytest`	≥ 8.0	42-test suite

Why pandas-ta instead of TA-Lib? TA-Lib requires compiling C extensions that frequently fail on Windows and different Python versions. pandas-ta is pure Python, installs with a single pip install, and covers the same indicator surface area.

Results

Pipeline execution (2026-03-01)

PIPELINE RUN SUMMARY
══════════════════════════════════════════════════════════════
 Symbol     Raw rows   Clean rows   Features   Date range
──────────────────────────────────────────────────────────────
 BTC-USD     2,191      2,153        77        2019-02-03 → 2024-12-25
 SPY         1,509      1,471        77        2019-02-20 → 2024-12-20
 AAPL        1,509      1,471        77        2019-02-20 → 2024-12-20
 GC=F        1,509      1,471        77        2019-02-20 → 2024-12-20
══════════════════════════════════════════════════════════════
 TOTAL       6,718      6,566        77        6 years of daily bars

Data quality reports

╔══ BTC-USD ══════════════════════════════════════
║  Rows:             2,191 →  2,153 (after cleaning)
║  Nulls remaining:      0
║  Time gaps:            0
║  Flatlines:            0 periods
║  Zero volume:          0
║  OHLC violations:      0
║  Outlier rows:         1   ← COVID flash crash, kept as-is
╚══ Status: ✓ CLEAN

╔══ SPY / AAPL ═══════════════════════════════════
║  Rows:             1,509 →  1,471 (after cleaning)
║  Time gaps:           55   ← US market holidays (expected)
║  All other checks:     0
╚══ Status: ✓ CLEAN

╔══ GC=F (Gold Futures) ══════════════════════════
║  Rows:             1,509 →  1,471 (after cleaning)
║  Zero volume:         12   ← Exchange downtime, forward-filled
║  Time gaps:           55   ← Market holidays (expected)
╚══ Status: ✓ CLEAN

Feature count breakdown

Family          Count   Description
────────────────────────────────────────────────────────────
Lag               10    Return lags + volume ratios + momentum
Technical         20    RSI, MACD, Bollinger, ATR, CCI, Stoch, OBV, ADX
Rolling           23    dist_to_sma, z-score, range position (all windows)
Volatility        12    Close-to-Close + Parkinson + Garman-Klass
Temporal           9    Sin/cos cyclic + calendar binary flags
Interaction        6    Cross-domain composite signals
────────────────────────────────────────────────────────────
TOTAL             77    + 4 target columns + 5 OHLCV = 87 total columns

Output files

data/features/
├── BTC-USD_features.parquet    1,534 KB   (snappy compressed)
├── SPY_features.parquet        1,065 KB
├── AAPL_features.parquet       1,066 KB
└── GC_F_features.parquet       1,042 KB

Test suite

pytest tests/ -v
═══════════════════════════════════════════════════════════════
 42 tests  |  0 failed  |  0 errors  |  3.57s
═══════════════════════════════════════════════════════════════

tests/test_features.py
  TestLagFeatureTransformer           7 / 7  PASSED
  TestRollingWindowTransformer        6 / 6  PASSED
  TestTechnicalIndicatorTransformer   6 / 6  PASSED
  TestTemporalFeatureTransformer      4 / 4  PASSED
  TestVolatilityTransformer           6 / 6  PASSED
  TestInteractionFeatureTransformer   3 / 3  PASSED

tests/test_pipeline.py
  TestTimeSeriesValidator             5 / 5  PASSED
  TestPipelineFeatures                4 / 4  PASSED
  TestNoDataLeakage                   1 / 1  PASSED  ← Critical

Feature Catalog

Lag Features (10)

Feature	Formula	Stationarity
`return_lag_Nd`	`Close_t / Close_{t-N} - 1`	✓ % change
`volume_ratio_lag_Nd`	`Vol_{t-N} / SMA20(Vol)_{t-N} - 1`	✓ relative ratio
`momentum_1_21`	`return_lag_1d - return_lag_21d`	✓ difference of returns
`momentum_regime`	`dist_to_sma_5 - dist_to_sma_50`	✓ difference of ratios

N ∈ {1, 2, 3, 5, 10, 21}

Technical Indicators (20)

Feature	Range	Notes
`rsi_14`, `rsi_28`	[0, 100]	Bounded by construction
`rsi_{N}_centered`	[-50, +50]	RSI shifted to center at 0
`macd_normalized`	~(-0.05, 0.05)	MACD line / Close — scale-free
`macd_signal_normalized`	~(-0.05, 0.05)	Signal line / Close
`macd_hist_normalized`	~(-0.02, 0.02)	Histogram / Close
`macd_hist_sign`	{-1, 0, +1}	Direction of momentum
`bb_pct_b`	[~0, ~1]	Bollinger % position
`bb_width_normalized`	(0, ∞)	(Upper−Lower) / Mid
`atr_pct`	(0, 0.10]	ATR / Close — volatility %
`cci_20`	~[-200, +200]	Commodity Channel Index
`stoch_k`, `stoch_d`	[0, 100]	Stochastic oscillators
`stoch_kd_diff`	[-100, +100]	K − D crossover signal
`obv_normalized`	~(-3, +3)	OBV z-score (20-bar window)
`williams_r`	[-100, 0]	Williams %R
`adx_14`	[0, 100]	Trend strength
`adx_directional_ratio`	[-1, +1]	(DM+ − DM−) / (DM+ + DM−)
`rsi_trend_divergence`	unbounded	(RSI−50) × dist_to_sma_20
`bb_rsi_composite`	[~0, ~1]	bb_pct_b × (RSI / 100)

Rolling Window Features (23)

Per window W ∈ {5, 10, 20, 50}:

Feature	Formula	Why not raw?
`dist_to_sma_W`	`(Close / SMA_W) − 1`	Raw SMA grows with price (non-stationary)
`rolling_std_returns_W`	`std(log_ret, W)`	Returns already stationary
`zscore_W`	`(Close − SMA) / std(Close, W)`	Standardized price position
`dist_to_high_W`	`(Close / max(Close, W)) − 1`	Always ≤ 0
`dist_to_low_W`	`(Close / min(Close, W)) − 1`	Always ≥ 0

Plus cross-window:

rolling_skew_20, rolling_kurt_20 — return distribution shape
trend_acceleration — dist_to_sma_5 − dist_to_sma_20

Volatility Features (12)

Three estimators at windows W ∈ {5, 10, 20}, all annualized:

Estimator	Formula	Efficiency vs C-to-C
`hist_vol_Wd`	`std(log_ret, W) × √252`	1× (baseline)
`parkinson_vol_Wd`	`√(1/4ln2 × E[(ln H/L)²]) × √252`	~5× more efficient
`garman_klass_vol_Wd`	`√(0.5(ln H/L)² − (2ln2−1)(ln C/O)²) × √252`	~8× more efficient

Plus:

high_low_pct — (High − Low) / Close
vol_of_vol — std of hist_vol_20d over 20 bars
vol_regime — hist_vol_20d / SMA_90(hist_vol_20d)

Temporal Features (9)

Feature	Encoding	Period	Why not integer?
`dow_sin`, `dow_cos`	sin/cos	5 (biz days)	Mon(0)↔Fri(4) must be neighbors
`month_sin`, `month_cos`	sin/cos	12	Dec↔Jan must be neighbors
`quarter_sin`, `quarter_cos`	sin/cos	4	Q4↔Q1 must be neighbors
`is_month_start`	binary	—	Turn-of-month liquidity effect
`is_month_end`	binary	—	Window-dressing effect
`is_quarter_end`	binary	—	Institutional rebalancing

Interaction Features (6)

"Instead of removing correlated features, add interaction features. The divergence between two correlated signals is often the most predictive part."

Feature	Formula	Economic Intuition
`trend_over_vol`	`dist_to_sma_20 / hist_vol_20d`	Strong trend in calm market = most reliable
`rsi_trend_divergence`	`(RSI−50) × dist_to_sma_20`	Overbought + negative trend → reversal
`momentum_regime`	`dist_to_sma_5 − dist_to_sma_50`	Bull/bear regime indicator
`vol_adj_return_1d`	`return_lag_1d / hist_vol_20d`	Bar-level Sharpe ratio
`bb_rsi_composite`	`bb_pct_b × (RSI / 100)`	Dual-confirmation extreme signal
`trend_volume_confirm`	`dist_to_sma_20 × volume_ratio_lag_1d`	Volume-confirmed trend

Target Variables (4)

Column	Type	Formula
`forward_return_1d`	Continuous	`log(Close_{t+1} / Close_t)`
`forward_return_5d`	Continuous	`log(Close_{t+5} / Close_t)`
`direction_1d`	Binary {0,1}	`int(forward_return_1d > 0)`
`direction_5d`	Binary {0,1}	`int(forward_return_5d > 0)`

Data leakage prevention: targets are computed via shift(-n) and added after all features. They are explicitly excluded from feature_cols by _get_feature_cols(). The anti-leakage integration test validates this.

Key Design Decisions

1. Stationarity over simplicity

# ❌ WRONG — non-stationary, causes distribution shift
rolling_mean_20 = Close.rolling(20).mean()

# ✓ CORRECT — stationary, comparable across all time
dist_to_sma_20 = (Close / rolling_mean_20) - 1

BTC traded at ~$10k in 2020 and ~$60k in 2024. A model trained on rolling means from 2020 would be useless when evaluated on 2024 data. All features in this pipeline are expressed as ratios, percentages, or z-scores.

2. TimeSeriesSplit — never random shuffle

Random split (WRONG for time series):
  Train: [t5, t10, t1, t8, ...]  → future leaks into training

TimeSeriesSplit (CORRECT):
  Fold 1: Train=t1..t200,  Val=t201..t250
  Fold 2: Train=t1..t250,  Val=t251..t300
  Fold 3: Train=t1..t300,  Val=t301..t350

Used in: Random Forest importance, Mutual Information cross-validation.

3. VIF as a diagnostic, not a scalpel

High VIF between EMA_20 and SMA_20 (~0.99 correlation) does not mean one should be removed. Their 1% divergence — the spread between exponential and simple smoothing — often carries the most significant signal. This pipeline uses VIF to flag areas for investigation and guide the construction of new interaction features.

4. Three volatility estimators

Estimator	Uses	Best when
Close-to-Close	Daily returns	Standard benchmark
Parkinson (1980)	High-Low range	Drift ≈ 0, gaps rare
Garman-Klass (1980)	OHLC prices	General case (~8× efficient)

Using all three provides redundancy and lets the model learn which estimator is more reliable in each market regime.

5. Cyclic temporal encoding

# ❌ Integer encoding breaks the cycle
day_of_week = 0  # Monday
day_of_week = 4  # Friday
# Monday(0) and Friday(4) appear distant, but they're adjacent in market time

# ✓ Sin/cos preserves the cycle
dow_sin = sin(2π × dayofweek / 5)
dow_cos = cos(2π × dayofweek / 5)
# Monday and Friday are equidistant from each other as from any other pair

Installation

# 1. Clone the repository
git clone https://github.com/your-user/alpha-feature-engineering-timeseries.git
cd alpha-feature-engineering-timeseries

# 2. Create and activate virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate      # Linux / macOS
.venv\Scripts\activate         # Windows

# 3. Install dependencies
pip install -r requirements.txt

Python version: 3.11+ recommended. No C compilation required — pandas-ta is pure Python.

Usage

Run the full pipeline

# Default: use cache if available, run feature selection
python run_pipeline.py

# Re-download all data (ignore cache)
python run_pipeline.py --force-download

# Skip feature selection (faster, for debugging)
python run_pipeline.py --no-selection

# Use a custom config file
python run_pipeline.py --config config/my_config.yaml

Use individual components in code

from src.pipeline import FeatureEngineeringPipeline

# From config file
pipeline = FeatureEngineeringPipeline.from_config("config/config.yaml")
results  = pipeline.run()

# Access a specific asset's feature DataFrame
btc_features = results["BTC-USD"]
print(btc_features.shape)  # (2153, 87)

Use a single transformer

import pandas as pd
from src.features.technical import TechnicalIndicatorTransformer

df = pd.read_parquet("data/raw/BTC-USD.parquet")
transformer = TechnicalIndicatorTransformer(rsi_lengths=[14, 28])
df_with_features = transformer.fit_transform(df)
print(df_with_features[["rsi_14", "atr_pct", "bb_pct_b"]].tail())

Load the feature dataset directly

import pandas as pd

btc = pd.read_parquet("data/features/BTC-USD_features.parquet")

# Separate features from targets
OHLCV   = {"Open", "High", "Low", "Close", "Volume"}
TARGETS = {c for c in btc.columns if "forward_return" in c or "direction" in c}
X = btc[[c for c in btc.columns if c not in OHLCV | TARGETS]]
y = btc["forward_return_1d"]

Configuration

All parameters are centralized in config/config.yaml:

data:
  symbols: ["BTC-USD", "SPY", "AAPL", "GC=F"]
  start_date: "2019-01-01"
  end_date:   "2024-12-31"
  interval:   "1d"

features:
  lag:
    periods: [1, 2, 3, 5, 10, 21]
  rolling:
    windows: [5, 10, 20, 50]
  volatility:
    windows: [5, 10, 20]

target:
  forward_periods: [1, 5]

Testing

# Run full test suite
pytest tests/ -v

# Run with coverage report
pytest tests/ --cov=src --cov-report=term-missing

# Run a single test class
pytest tests/test_pipeline.py::TestNoDataLeakage -v

Test categories

Class	Tests	What is validated
`TestLagFeatureTransformer`	7	Columns, stationarity, look-ahead bias
`TestRollingWindowTransformer`	6	Ratio encoding, NaN position, bounds
`TestTechnicalIndicatorTransformer`	6	RSI bounds, ATR scale, MACD normalization
`TestTemporalFeatureTransformer`	4	Sin/cos range, zero NaNs, binary flags
`TestVolatilityTransformer`	6	Non-negative vol, regime mean, Parkinson
`TestInteractionFeatureTransformer`	3	Prerequisites, output presence
`TestTimeSeriesValidator`	5	Zero-vol, flatlines, forward-fill
`TestPipelineFeatures`	4	Target isolation, 50+ features, OHLCV preserved
`TestNoDataLeakage`	1	Critical: features identical after truncation

The anti-leakage test

The most important test in the project:

def test_features_unchanged_after_truncation(self):
    """
    Remove the last 20 rows of the dataset.
    Feature values for rows t1..t_{n-20} must be IDENTICAL.
    Any feature that uses future data would fail this test.
    """
    df_full  = pipeline._build_features(df)              # n rows
    df_trunc = pipeline._build_features(df.iloc[:-20])   # n-20 rows

    pd.testing.assert_frame_equal(
        df_full[common_cols].iloc[60:n-20],
        df_trunc[common_cols].iloc[60:n-20],
        rtol=1e-5,
    )

Project Structure

alpha-feature-engineering-timeseries/
│
├── config/
│   └── config.yaml                  ← All pipeline parameters
│
├── data/
│   ├── raw/                         ← Raw OHLCV Parquet (cache)
│   ├── processed/                   ← Validated OHLCV (intermediate)
│   └── features/                    ← Final feature datasets (output)
│       ├── BTC-USD_features.parquet
│       ├── SPY_features.parquet
│       ├── AAPL_features.parquet
│       └── GC_F_features.parquet
│
├── src/
│   ├── data/
│   │   ├── loader.py                ← yfinance download + retry + cache
│   │   └── validator.py             ← OHLC/flatline/gap/volume checks
│   │
│   ├── features/
│   │   ├── base.py                  ← BaseFeatureTransformer (abstract)
│   │   ├── lag.py                   ← Return lags, volume ratios
│   │   ├── technical.py             ← RSI, MACD, BB, ATR, CCI, Stoch, ADX
│   │   ├── rolling.py               ← dist_to_sma, z-score, range position
│   │   ├── volatility.py            ← C-to-C, Parkinson, Garman-Klass
│   │   ├── temporal.py              ← Sin/cos cyclic calendar encoding
│   │   └── interaction.py           ← Cross-domain composite signals
│   │
│   ├── selection/
│   │   ├── correlation.py           ← Pearson/Spearman, VIF, heatmap
│   │   └── importance.py            ← RF + MI with TimeSeriesSplit
│   │
│   └── pipeline.py                  ← End-to-end orchestrator
│
├── notebooks/
│   ├── 01_data_exploration.ipynb    ← ADF test, distributions, validation
│   ├── 02_feature_engineering.ipynb ← Full pipeline walkthrough
│   └── 03_feature_analysis.ipynb    ← Heatmap, VIF, RF/MI importance
│
├── tests/
│   ├── conftest.py                  ← Shared make_ohlcv() fixture
│   ├── test_features.py             ← 32 unit tests per transformer family
│   └── test_pipeline.py             ← 10 integration + anti-leakage tests
│
├── outputs/
│   ├── figures/                     ← correlation_heatmap.png, feature_importance.png
│   └── reports/                     ← target_correlations.csv, rf_importance.csv
│
├── run_pipeline.py                  ← CLI entry point
├── pytest.ini                       ← Test configuration
└── requirements.txt

Outputs

After a full run (python run_pipeline.py), the following files are generated:

Path	Format	Contents
`data/features/*_features.parquet`	Parquet	77 features + 4 targets + 5 OHLCV
`outputs/figures/correlation_heatmap.png`	PNG	Top-30 feature correlation matrix
`outputs/figures/feature_importance.png`	PNG	RF + MI side-by-side bar charts
`outputs/figures/03_spearman_correlations.png`	PNG	Spearman ρ rankings
`outputs/reports/target_correlations.csv`	CSV	Pearson + Spearman with p-values
`outputs/reports/rf_importance_regression.csv`	CSV	RF importance (mean ± std, 5 folds)
`outputs/reports/rf_importance_classification.csv`	CSV	RF importance for direction target
`outputs/reports/mutual_information.csv`	CSV	MI scores for all features
`outputs/reports/high_correlation_pairs.csv`	CSV	Feature pairs with \|corr\| ≥ 0.95
`outputs/reports/{symbol}_summary.json`	JSON	Run metadata + top-20 features
`outputs/reports/pipeline.log`	Log	Structured loguru output

References

Parkinson, M. (1980). The Extreme Value Method for Estimating the Variance of the Rate of Return. The Journal of Business.
Garman, M. & Klass, M. (1980). On the Estimation of Security Price Volatilities. The Journal of Business.
López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley.

License

MIT License — see LICENSE for details.

Built with precision. Validated with rigor. Documented for production.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
config		config
data		data
notebooks		notebooks
outputs		outputs
src		src
tests		tests
LICENSE		LICENSE
README.md		README.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py

Folders and files

Latest commit

History

Repository files navigation

Alpha Feature Engineering — Time Series

Table of Contents

Overview

Architecture

Transformer dependency graph

Stack

Results

Pipeline execution (2026-03-01)

Data quality reports

Feature count breakdown

Output files

Test suite

Feature Catalog

Lag Features (10)

Technical Indicators (20)

Rolling Window Features (23)

Volatility Features (12)

Temporal Features (9)

Interaction Features (6)

Target Variables (4)

Key Design Decisions

1. Stationarity over simplicity

2. TimeSeriesSplit — never random shuffle

3. VIF as a diagnostic, not a scalpel

4. Three volatility estimators

5. Cyclic temporal encoding

Installation

Usage

Run the full pipeline

Use individual components in code

Use a single transformer

Load the feature dataset directly

Configuration

Testing

Test categories

The anti-leakage test

Project Structure

Outputs

References

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages