Skip to content

Ares-Infenus/alpha-feature-engineering-timeseries

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Alpha Feature Engineering — Time Series

Python 3.11+ pandas-ta sklearn Tests License: MIT

Production-grade feature engineering pipeline for financial time series. Transforms raw OHLCV data into 77 stationary, ML-ready features across 6 signal families, with full data quality validation, anti-leakage guarantees, and automated feature selection.


Table of Contents

  1. Overview
  2. Architecture
  3. Stack
  4. Results
  5. Feature Catalog
  6. Key Design Decisions
  7. Installation
  8. Usage
  9. Testing
  10. Project Structure
  11. Outputs

Overview

This project solves one of the most critical and underestimated problems in quantitative machine learning: turning raw market data into predictive features that are stable, stationary, and leak-free.

Most practitioners make the same three mistakes:

Common Mistake This Project's Solution
Raw prices as features All price-based features normalized as ratios or distances
Random train_test_split TimeSeriesSplit enforced in all model-based evaluations
Removing high-VIF features VIF used as diagnostic, not auto-pruner; interaction features added

The pipeline processes 4 assets (BTC-USD, SPY, AAPL, GC=F) across 6 years of daily bars, producing 77 stationary features per asset, saved as compressed Parquet files for downstream ML training.


Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                    FEATURE ENGINEERING PIPELINE                         │
│                                                                         │
│  ┌──────────────┐    ┌──────────────────┐    ┌────────────────────┐    │
│  │  DATA LAYER  │    │  FEATURE LAYER   │    │  SELECTION LAYER   │    │
│  │              │    │                  │    │                    │    │
│  │  yfinance    │───▶│  LagTransformer  │    │  Pearson/Spearman  │    │
│  │  4 symbols   │    │  TechIndicators  │───▶│  VIF Diagnostics   │    │
│  │  2019-2024   │    │  RollingWindows  │    │  RF Importance     │    │
│  │              │    │  Volatility      │    │  Mutual Info       │    │
│  │  Validator   │    │  Temporal        │    │  TimeSeriesSplit   │    │
│  │  (OHLC/gaps/ │    │  Interactions    │    │                    │    │
│  │  flatlines/  │    │                  │    │  Top-20 selector   │    │
│  │  zero-vol)   │    │  77 features     │    │  (avg_rank strat.) │    │
│  └──────────────┘    └──────────────────┘    └────────────────────┘    │
│         │                    │                         │               │
│         ▼                    ▼                         ▼               │
│   data/raw/*.parquet   data/features/          outputs/figures/        │
│                        *_features.parquet      outputs/reports/        │
└─────────────────────────────────────────────────────────────────────────┘

Transformer dependency graph

┌─────────────┐  ┌─────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────┐
│     Lag     │  │  Technical  │  │   Rolling    │  │  Volatility  │  │ Temporal │
│  (no deps)  │  │  (no deps)  │  │  (no deps)   │  │  (no deps)   │  │ (no deps)│
└──────┬──────┘  └──────┬──────┘  └──────┬───────┘  └──────┬───────┘  └────┬─────┘
       │                │                │                  │               │
       └────────────────┴────────────────┴──────────────────┴───────────────┘
                                         │
                                         ▼
                               ┌──────────────────┐
                               │   Interaction    │
                               │ (depends on all) │
                               └──────────────────┘

Stack

Library Version Role
pandas ≥ 2.1 Core data manipulation
numpy ≥ 1.26 Numerical computation
yfinance ≥ 0.2.40 OHLCV data source
pandas-ta ≥ 0.3.14b0 Technical indicators (pure Python — no C compile)
scikit-learn ≥ 1.4 Pipeline interface + feature selection
scipy ≥ 1.12 Statistical tests, distributions
statsmodels ≥ 0.14 VIF computation
pyarrow ≥ 15.0 Parquet I/O (columnar, compressed)
loguru ≥ 0.7.2 Structured logging
pydantic ≥ 2.6 Configuration validation
matplotlib / seaborn ≥ 3.8 / 0.13 Publication-ready visualization
pytest ≥ 8.0 42-test suite

Why pandas-ta instead of TA-Lib? TA-Lib requires compiling C extensions that frequently fail on Windows and different Python versions. pandas-ta is pure Python, installs with a single pip install, and covers the same indicator surface area.


Results

Pipeline execution (2026-03-01)

PIPELINE RUN SUMMARY
══════════════════════════════════════════════════════════════
 Symbol     Raw rows   Clean rows   Features   Date range
──────────────────────────────────────────────────────────────
 BTC-USD     2,191      2,153        77        2019-02-03 → 2024-12-25
 SPY         1,509      1,471        77        2019-02-20 → 2024-12-20
 AAPL        1,509      1,471        77        2019-02-20 → 2024-12-20
 GC=F        1,509      1,471        77        2019-02-20 → 2024-12-20
══════════════════════════════════════════════════════════════
 TOTAL       6,718      6,566        77        6 years of daily bars

Data quality reports

╔══ BTC-USD ══════════════════════════════════════
║  Rows:             2,191 →  2,153 (after cleaning)
║  Nulls remaining:      0
║  Time gaps:            0
║  Flatlines:            0 periods
║  Zero volume:          0
║  OHLC violations:      0
║  Outlier rows:         1   ← COVID flash crash, kept as-is
╚══ Status: ✓ CLEAN

╔══ SPY / AAPL ═══════════════════════════════════
║  Rows:             1,509 →  1,471 (after cleaning)
║  Time gaps:           55   ← US market holidays (expected)
║  All other checks:     0
╚══ Status: ✓ CLEAN

╔══ GC=F (Gold Futures) ══════════════════════════
║  Rows:             1,509 →  1,471 (after cleaning)
║  Zero volume:         12   ← Exchange downtime, forward-filled
║  Time gaps:           55   ← Market holidays (expected)
╚══ Status: ✓ CLEAN

Feature count breakdown

Family          Count   Description
────────────────────────────────────────────────────────────
Lag               10    Return lags + volume ratios + momentum
Technical         20    RSI, MACD, Bollinger, ATR, CCI, Stoch, OBV, ADX
Rolling           23    dist_to_sma, z-score, range position (all windows)
Volatility        12    Close-to-Close + Parkinson + Garman-Klass
Temporal           9    Sin/cos cyclic + calendar binary flags
Interaction        6    Cross-domain composite signals
────────────────────────────────────────────────────────────
TOTAL             77    + 4 target columns + 5 OHLCV = 87 total columns

Output files

data/features/
├── BTC-USD_features.parquet    1,534 KB   (snappy compressed)
├── SPY_features.parquet        1,065 KB
├── AAPL_features.parquet       1,066 KB
└── GC_F_features.parquet       1,042 KB

Test suite

pytest tests/ -v
═══════════════════════════════════════════════════════════════
 42 tests  |  0 failed  |  0 errors  |  3.57s
═══════════════════════════════════════════════════════════════

tests/test_features.py
  TestLagFeatureTransformer           7 / 7  PASSED
  TestRollingWindowTransformer        6 / 6  PASSED
  TestTechnicalIndicatorTransformer   6 / 6  PASSED
  TestTemporalFeatureTransformer      4 / 4  PASSED
  TestVolatilityTransformer           6 / 6  PASSED
  TestInteractionFeatureTransformer   3 / 3  PASSED

tests/test_pipeline.py
  TestTimeSeriesValidator             5 / 5  PASSED
  TestPipelineFeatures                4 / 4  PASSED
  TestNoDataLeakage                   1 / 1  PASSED  ← Critical

Feature Catalog

Lag Features (10)

Feature Formula Stationarity
return_lag_Nd Close_t / Close_{t-N} - 1 ✓ % change
volume_ratio_lag_Nd Vol_{t-N} / SMA20(Vol)_{t-N} - 1 ✓ relative ratio
momentum_1_21 return_lag_1d - return_lag_21d ✓ difference of returns
momentum_regime dist_to_sma_5 - dist_to_sma_50 ✓ difference of ratios

N ∈ {1, 2, 3, 5, 10, 21}

Technical Indicators (20)

Feature Range Notes
rsi_14, rsi_28 [0, 100] Bounded by construction
rsi_{N}_centered [-50, +50] RSI shifted to center at 0
macd_normalized ~(-0.05, 0.05) MACD line / Close — scale-free
macd_signal_normalized ~(-0.05, 0.05) Signal line / Close
macd_hist_normalized ~(-0.02, 0.02) Histogram / Close
macd_hist_sign {-1, 0, +1} Direction of momentum
bb_pct_b [~0, ~1] Bollinger % position
bb_width_normalized (0, ∞) (Upper−Lower) / Mid
atr_pct (0, 0.10] ATR / Close — volatility %
cci_20 ~[-200, +200] Commodity Channel Index
stoch_k, stoch_d [0, 100] Stochastic oscillators
stoch_kd_diff [-100, +100] K − D crossover signal
obv_normalized ~(-3, +3) OBV z-score (20-bar window)
williams_r [-100, 0] Williams %R
adx_14 [0, 100] Trend strength
adx_directional_ratio [-1, +1] (DM+ − DM−) / (DM+ + DM−)
rsi_trend_divergence unbounded (RSI−50) × dist_to_sma_20
bb_rsi_composite [~0, ~1] bb_pct_b × (RSI / 100)

Rolling Window Features (23)

Per window W ∈ {5, 10, 20, 50}:

Feature Formula Why not raw?
dist_to_sma_W (Close / SMA_W) − 1 Raw SMA grows with price (non-stationary)
rolling_std_returns_W std(log_ret, W) Returns already stationary
zscore_W (Close − SMA) / std(Close, W) Standardized price position
dist_to_high_W (Close / max(Close, W)) − 1 Always ≤ 0
dist_to_low_W (Close / min(Close, W)) − 1 Always ≥ 0

Plus cross-window:

  • rolling_skew_20, rolling_kurt_20 — return distribution shape
  • trend_accelerationdist_to_sma_5 − dist_to_sma_20

Volatility Features (12)

Three estimators at windows W ∈ {5, 10, 20}, all annualized:

Estimator Formula Efficiency vs C-to-C
hist_vol_Wd std(log_ret, W) × √252 1× (baseline)
parkinson_vol_Wd √(1/4ln2 × E[(ln H/L)²]) × √252 ~5× more efficient
garman_klass_vol_Wd √(0.5(ln H/L)² − (2ln2−1)(ln C/O)²) × √252 ~8× more efficient

Plus:

  • high_low_pct(High − Low) / Close
  • vol_of_vol — std of hist_vol_20d over 20 bars
  • vol_regimehist_vol_20d / SMA_90(hist_vol_20d)

Temporal Features (9)

Feature Encoding Period Why not integer?
dow_sin, dow_cos sin/cos 5 (biz days) Mon(0)↔Fri(4) must be neighbors
month_sin, month_cos sin/cos 12 Dec↔Jan must be neighbors
quarter_sin, quarter_cos sin/cos 4 Q4↔Q1 must be neighbors
is_month_start binary Turn-of-month liquidity effect
is_month_end binary Window-dressing effect
is_quarter_end binary Institutional rebalancing

Interaction Features (6)

"Instead of removing correlated features, add interaction features. The divergence between two correlated signals is often the most predictive part."

Feature Formula Economic Intuition
trend_over_vol dist_to_sma_20 / hist_vol_20d Strong trend in calm market = most reliable
rsi_trend_divergence (RSI−50) × dist_to_sma_20 Overbought + negative trend → reversal
momentum_regime dist_to_sma_5 − dist_to_sma_50 Bull/bear regime indicator
vol_adj_return_1d return_lag_1d / hist_vol_20d Bar-level Sharpe ratio
bb_rsi_composite bb_pct_b × (RSI / 100) Dual-confirmation extreme signal
trend_volume_confirm dist_to_sma_20 × volume_ratio_lag_1d Volume-confirmed trend

Target Variables (4)

Column Type Formula
forward_return_1d Continuous log(Close_{t+1} / Close_t)
forward_return_5d Continuous log(Close_{t+5} / Close_t)
direction_1d Binary {0,1} int(forward_return_1d > 0)
direction_5d Binary {0,1} int(forward_return_5d > 0)

Data leakage prevention: targets are computed via shift(-n) and added after all features. They are explicitly excluded from feature_cols by _get_feature_cols(). The anti-leakage integration test validates this.


Key Design Decisions

1. Stationarity over simplicity

# ❌ WRONG — non-stationary, causes distribution shift
rolling_mean_20 = Close.rolling(20).mean()

# ✓ CORRECT — stationary, comparable across all time
dist_to_sma_20 = (Close / rolling_mean_20) - 1

BTC traded at ~$10k in 2020 and ~$60k in 2024. A model trained on rolling means from 2020 would be useless when evaluated on 2024 data. All features in this pipeline are expressed as ratios, percentages, or z-scores.

2. TimeSeriesSplit — never random shuffle

Random split (WRONG for time series):
  Train: [t5, t10, t1, t8, ...]  → future leaks into training

TimeSeriesSplit (CORRECT):
  Fold 1: Train=t1..t200,  Val=t201..t250
  Fold 2: Train=t1..t250,  Val=t251..t300
  Fold 3: Train=t1..t300,  Val=t301..t350

Used in: Random Forest importance, Mutual Information cross-validation.

3. VIF as a diagnostic, not a scalpel

High VIF between EMA_20 and SMA_20 (~0.99 correlation) does not mean one should be removed. Their 1% divergence — the spread between exponential and simple smoothing — often carries the most significant signal. This pipeline uses VIF to flag areas for investigation and guide the construction of new interaction features.

4. Three volatility estimators

Estimator Uses Best when
Close-to-Close Daily returns Standard benchmark
Parkinson (1980) High-Low range Drift ≈ 0, gaps rare
Garman-Klass (1980) OHLC prices General case (~8× efficient)

Using all three provides redundancy and lets the model learn which estimator is more reliable in each market regime.

5. Cyclic temporal encoding

# ❌ Integer encoding breaks the cycle
day_of_week = 0  # Monday
day_of_week = 4  # Friday
# Monday(0) and Friday(4) appear distant, but they're adjacent in market time

# ✓ Sin/cos preserves the cycle
dow_sin = sin(2π × dayofweek / 5)
dow_cos = cos(2π × dayofweek / 5)
# Monday and Friday are equidistant from each other as from any other pair

Installation

# 1. Clone the repository
git clone https://github.com/your-user/alpha-feature-engineering-timeseries.git
cd alpha-feature-engineering-timeseries

# 2. Create and activate virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate      # Linux / macOS
.venv\Scripts\activate         # Windows

# 3. Install dependencies
pip install -r requirements.txt

Python version: 3.11+ recommended. No C compilation requiredpandas-ta is pure Python.


Usage

Run the full pipeline

# Default: use cache if available, run feature selection
python run_pipeline.py

# Re-download all data (ignore cache)
python run_pipeline.py --force-download

# Skip feature selection (faster, for debugging)
python run_pipeline.py --no-selection

# Use a custom config file
python run_pipeline.py --config config/my_config.yaml

Use individual components in code

from src.pipeline import FeatureEngineeringPipeline

# From config file
pipeline = FeatureEngineeringPipeline.from_config("config/config.yaml")
results  = pipeline.run()

# Access a specific asset's feature DataFrame
btc_features = results["BTC-USD"]
print(btc_features.shape)  # (2153, 87)

Use a single transformer

import pandas as pd
from src.features.technical import TechnicalIndicatorTransformer

df = pd.read_parquet("data/raw/BTC-USD.parquet")
transformer = TechnicalIndicatorTransformer(rsi_lengths=[14, 28])
df_with_features = transformer.fit_transform(df)
print(df_with_features[["rsi_14", "atr_pct", "bb_pct_b"]].tail())

Load the feature dataset directly

import pandas as pd

btc = pd.read_parquet("data/features/BTC-USD_features.parquet")

# Separate features from targets
OHLCV   = {"Open", "High", "Low", "Close", "Volume"}
TARGETS = {c for c in btc.columns if "forward_return" in c or "direction" in c}
X = btc[[c for c in btc.columns if c not in OHLCV | TARGETS]]
y = btc["forward_return_1d"]

Configuration

All parameters are centralized in config/config.yaml:

data:
  symbols: ["BTC-USD", "SPY", "AAPL", "GC=F"]
  start_date: "2019-01-01"
  end_date:   "2024-12-31"
  interval:   "1d"

features:
  lag:
    periods: [1, 2, 3, 5, 10, 21]
  rolling:
    windows: [5, 10, 20, 50]
  volatility:
    windows: [5, 10, 20]

target:
  forward_periods: [1, 5]

Testing

# Run full test suite
pytest tests/ -v

# Run with coverage report
pytest tests/ --cov=src --cov-report=term-missing

# Run a single test class
pytest tests/test_pipeline.py::TestNoDataLeakage -v

Test categories

Class Tests What is validated
TestLagFeatureTransformer 7 Columns, stationarity, look-ahead bias
TestRollingWindowTransformer 6 Ratio encoding, NaN position, bounds
TestTechnicalIndicatorTransformer 6 RSI bounds, ATR scale, MACD normalization
TestTemporalFeatureTransformer 4 Sin/cos range, zero NaNs, binary flags
TestVolatilityTransformer 6 Non-negative vol, regime mean, Parkinson
TestInteractionFeatureTransformer 3 Prerequisites, output presence
TestTimeSeriesValidator 5 Zero-vol, flatlines, forward-fill
TestPipelineFeatures 4 Target isolation, 50+ features, OHLCV preserved
TestNoDataLeakage 1 Critical: features identical after truncation

The anti-leakage test

The most important test in the project:

def test_features_unchanged_after_truncation(self):
    """
    Remove the last 20 rows of the dataset.
    Feature values for rows t1..t_{n-20} must be IDENTICAL.
    Any feature that uses future data would fail this test.
    """
    df_full  = pipeline._build_features(df)              # n rows
    df_trunc = pipeline._build_features(df.iloc[:-20])   # n-20 rows

    pd.testing.assert_frame_equal(
        df_full[common_cols].iloc[60:n-20],
        df_trunc[common_cols].iloc[60:n-20],
        rtol=1e-5,
    )

Project Structure

alpha-feature-engineering-timeseries/
│
├── config/
│   └── config.yaml                  ← All pipeline parameters
│
├── data/
│   ├── raw/                         ← Raw OHLCV Parquet (cache)
│   ├── processed/                   ← Validated OHLCV (intermediate)
│   └── features/                    ← Final feature datasets (output)
│       ├── BTC-USD_features.parquet
│       ├── SPY_features.parquet
│       ├── AAPL_features.parquet
│       └── GC_F_features.parquet
│
├── src/
│   ├── data/
│   │   ├── loader.py                ← yfinance download + retry + cache
│   │   └── validator.py             ← OHLC/flatline/gap/volume checks
│   │
│   ├── features/
│   │   ├── base.py                  ← BaseFeatureTransformer (abstract)
│   │   ├── lag.py                   ← Return lags, volume ratios
│   │   ├── technical.py             ← RSI, MACD, BB, ATR, CCI, Stoch, ADX
│   │   ├── rolling.py               ← dist_to_sma, z-score, range position
│   │   ├── volatility.py            ← C-to-C, Parkinson, Garman-Klass
│   │   ├── temporal.py              ← Sin/cos cyclic calendar encoding
│   │   └── interaction.py           ← Cross-domain composite signals
│   │
│   ├── selection/
│   │   ├── correlation.py           ← Pearson/Spearman, VIF, heatmap
│   │   └── importance.py            ← RF + MI with TimeSeriesSplit
│   │
│   └── pipeline.py                  ← End-to-end orchestrator
│
├── notebooks/
│   ├── 01_data_exploration.ipynb    ← ADF test, distributions, validation
│   ├── 02_feature_engineering.ipynb ← Full pipeline walkthrough
│   └── 03_feature_analysis.ipynb    ← Heatmap, VIF, RF/MI importance
│
├── tests/
│   ├── conftest.py                  ← Shared make_ohlcv() fixture
│   ├── test_features.py             ← 32 unit tests per transformer family
│   └── test_pipeline.py             ← 10 integration + anti-leakage tests
│
├── outputs/
│   ├── figures/                     ← correlation_heatmap.png, feature_importance.png
│   └── reports/                     ← target_correlations.csv, rf_importance.csv
│
├── run_pipeline.py                  ← CLI entry point
├── pytest.ini                       ← Test configuration
└── requirements.txt

Outputs

After a full run (python run_pipeline.py), the following files are generated:

Path Format Contents
data/features/*_features.parquet Parquet 77 features + 4 targets + 5 OHLCV
outputs/figures/correlation_heatmap.png PNG Top-30 feature correlation matrix
outputs/figures/feature_importance.png PNG RF + MI side-by-side bar charts
outputs/figures/03_spearman_correlations.png PNG Spearman ρ rankings
outputs/reports/target_correlations.csv CSV Pearson + Spearman with p-values
outputs/reports/rf_importance_regression.csv CSV RF importance (mean ± std, 5 folds)
outputs/reports/rf_importance_classification.csv CSV RF importance for direction target
outputs/reports/mutual_information.csv CSV MI scores for all features
outputs/reports/high_correlation_pairs.csv CSV Feature pairs with |corr| ≥ 0.95
outputs/reports/{symbol}_summary.json JSON Run metadata + top-20 features
outputs/reports/pipeline.log Log Structured loguru output

References

  • Parkinson, M. (1980). The Extreme Value Method for Estimating the Variance of the Rate of Return. The Journal of Business.
  • Garman, M. & Klass, M. (1980). On the Estimation of Security Price Volatilities. The Journal of Business.
  • López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley.

License

MIT License — see LICENSE for details.


Built with precision. Validated with rigor. Documented for production.

About

Production-grade pipeline transforming raw OHLCV into 77 stationary, ML-ready features. Includes anti-leakage validation, TimeSeriesSplit selection, and 6 signal families for financial time series.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors