The Feature Engineering System is a flexible, plugin-based tool designed for generating and selecting features from time-series data. This system allows for the integration of various feature engineering techniques via plugins, starting with the generation of technical indicators and supporting future methods such as Singular Spectrum Analysis (SSA) and Fast Fourier Transform (FFT).
- Plugin-Based Architecture: The system uses a modular plugin architecture, allowing users to add, configure, and switch between different feature generation methods. The initial implementation includes a technical indicator generator, with support for additional plugins like SSA, FFT, and others in the future.
- Configurable Parameters: The system allows for dynamic configuration of input parameters such as input/output file paths, method-specific parameters, and other options via a command-line interface (CLI).
- Correlation and Distribution Analysis: Users can automatically compute and visualize Pearson and Spearman correlation matrices to identify relationships between features. The system also supports the visualization of feature distributions to help in manual feature selection.
- Manual Feature Selection: Users can manually select which features (e.g., technical indicators, SSA components) to include in the final output dataset, based on the results of the correlation and distribution analysis.
This tool is designed for data scientists, quantitative analysts, and machine learning practitioners working with time-series data in applications like financial modeling, trading strategies, and predictive analytics.
To install and set up the feature-eng application, follow these steps:
-
Clone the Repository:
git clone https://github.com/harveybc/feature-eng.git cd feature-eng
-
Create and Activate a Virtual Environment:
- Using
conda
:conda create --name feature-eng-env python=3.9 conda activate feature-eng-env
- Using
-
Install Dependencies:
pip install --upgrade pip pip install -r requirements.txt
-
Build the Package:
python -m build
-
Install the Package:
pip install .
-
(Optional) Run the feature-eng CLI:
- On Windows, verify installation:
feature-eng.bat --help
- On Linux:
sh feature-eng.sh --help
- On Windows, verify installation:
-
(Optional) Run Tests:
- On Windows:
set_env.bat pytest
- On Linux:
sh ./set_env.sh pytest
- On Windows:
The application provides a command-line interface to control its behavior and manage feature generation through plugins.
input_file
(str): Path to the input CSV file.
output_file
(str, optional): Path to the output CSV file. If not specified, the system will not generate an output file.plugin
(str, default='technical_indicator'): Name of the plugin to use for feature generation. The default plugin generates technical indicators, but additional plugins such as SSA and FFT can be used.correlation_analysis
(flag): Compute and display Pearson and Spearman correlation matrices.distribution_plot
(flag): Plot the distributions of the generated features.quiet_mode
(flag): Suppress output messages to reduce verbosity.save_log
(str): Path to save the current debug log.username
(str): Username for the remote API endpoint.password
(str): Password for the remote API endpoint.remote_save_config
(str): URL of a remote API endpoint for saving the configuration in JSON format.remote_load_config
(str): URL of a remote JSON configuration file to download and execute.remote_log
(str): URL of a remote API endpoint for saving debug variables in JSON format.load_config
(str): Path to load a configuration file.save_config
(str): Path to save the current configuration.
To generate technical indicators using the default plugin:
f-eng.bat tests/data/eurusd_hour_2005_2020_ohlc.csv
To perform SSA feature extraction:
f-eng.bat tests/data/eurusd_hour_2005_2020_ohlc.csv --plugin ssa
To perform FFT feature extraction:
f-eng.bat tests/data/eurusd_hour_2005_2020_ohlc.csv --plugin fft
Distribution plotting is enabled to visualize the distributions of the generated technical indicators, a normal distribution is recommended when using Pearson correlation (see Correlation Analysis), while a Spearman is recommended otherwise:
f-eng.bat tests/data/eurusd_hour_2005_2020_ohlc.csv --distribution_plot
To compute and display correlation matrices for the generated features, use Pearson preferibly when the features have normal distribution, otherwise, use Spearman, both are calculated with the command:
f-eng.bat --input_file tests/data/eurusd_hour_2005_2020_ohlc.csv --correlation_analysis
feature-eng/
│
├── app/ # Main application package
│ ├── cli.py # Handles command-line argument parsing
│ ├── config.py # Stores default configuration values
│ ├── config_merger.py # Merges configuration from various sources
│ ├── plugin_loader.py # Dynamically loads feature engineering plugins
│ ├── data_handler.py # Handles data loading and saving
│ ├── data_processor.py # Processes input data and runs the feature extraction pipeline
│ ├── main.py # Main entry point for the application
│ └── plugins/ # Plugin directory
│ ├── technical_indicator.py # Plugin for generating technical indicators
│ ├── ssa.py # Plugin for Singular Spectrum Analysis (future)
│ └── fft.py # Plugin for Fast Fourier Transform (future)
│
├── tests/ # Test modules for the application
│ ├── system # System tests
│ └── unit # Unit tests
│
├── README.md # Project documentation
├── requirements.txt # Python package dependencies
├── setup.py # Script for packaging and installing the project
├── set_env.bat # Batch script for environment setup
├── set_env.sh # Shell script for environment setup
└── .gitignore # Specifies untracked files to ignore
Contributions to the project are welcome! Please refer to the CONTRIBUTING.md
file for guidelines on how to contribute.
This project is licensed under the MIT License - see the LICENSE file for details.