Ensemble Data Generator

A sophisticated statistical modeling system that generates synthetic datasets while preserving the statistical properties of original data using Gaussian Mixture Models (GMM) and Kernel Density Estimation (KDE).

🎯 Features

Dual Model Architecture: Implements both GMM and KDE for comprehensive data modeling
Automated Model Selection: Uses BIC/AIC criteria for optimal model selection
Comprehensive Validation: Multi-level statistical validation framework
Scalable Generation: Generate from hundreds to thousands of synthetic samples
Statistical Preservation: Maintains mean, variance, skewness, and kurtosis
Performance Monitoring: Real-time generation speed and quality metrics
Rich Visualizations: 5 comprehensive visualization suites for analysis
Cross-Platform: Works on Windows, macOS, and Linux

📦 Requirements

System Requirements

Python 3.8+ (tested on Python 3.11, 3.12, 3.13)
4GB+ RAM recommended
1GB+ free disk space

Python Dependencies

numpy>=1.21.0
pandas>=1.3.0
scikit-learn>=1.1.0
matplotlib>=3.5.0
seaborn>=0.11.0
scipy>=1.8.0
joblib>=1.1.0

🔧 Installation

Option 1: Clone and Install

# Clone the repository
git clone https://github.com/your-username/ensemble-data-generator.git
cd ensemble-data-generator

# Create virtual environment
python -m venv ensemble_env
source ensemble_env/bin/activate  

# Install dependencies
pip install -r requirements.txt

🚀Start

Step 1: Check Your Data

Place your CSV data file in the data/ directory. The system expects:

Single column of numerical data
CSV format with headers
Example: data/PPE_262_Tropic.csv

Step 2: Run the Complete Pipeline

Change the path of your file at Line 537 in train.py

python Train.py

This will automatically:

✅ Load and analyze your data
✅ Train GMM and KDE models
✅ Select the best performing model
✅ Generate synthetic datasets (1K, 5K, 15K samples)
✅ Validate all results
✅ Save models and reports

Step 3: Generate

Change the path of your file at Line 417 and Line 497 in Generation.py

python generation.py

Step 4: Plot

Run the Plots.ipynb place the path of the generated CSV file

This creates 5 comprehensive visualization suites showing your results.

📁 Project Structure

ensemble_project/
├── 📄 README.md                          # This file
├── 🐍 train.py                     # Main execution script
├── 🎨 generation.py             # Visualization generation
├── 📋 requirements.txt                   # Dependencies
├── 📊 data/
│   └── PPE_262_Tropic.csv               # Input data                       # Helper functions
├── 📈 models/
│   ├── tranied_model.pkl # Trained models
│   ├── reports/
├── 📈 ensemble_gen/
│   │──CESM/ # Contains CESM 500, 1000, 5000 Generated CSVs and their reports
│   │──MPI/ # Contains MPI 500, 1000, 5000 Generated CSVs and their reports
│   └──PPE/ # Contains PPE 500, 1000, 5000 Generated CSVs and their reports
│   
└── 📸 plots/
│     └── visualization_outputs/            # Generated plots
│   
└── Plots.ipynb  # Code to generate Plots

📊 Understanding Results

Quality Metrics Explained

Overall Quality Score (0-100)

EXCELLENT (90-100): Outstanding preservation of all statistical properties
GOOD (70-89): Good preservation with minor deviations
FAIR (50-69): Acceptable quality with some limitations
POOR (0-49): Significant deviations, needs improvement

Statistical Tests

Distribution Match (p-value): Kolmogorov-Smirnov test
- p > 0.05: Generated data follows original distribution ✅
- p ≤ 0.05: Significant distribution difference ❌

Preservation Metrics

Mean Preservation: |generated_mean - original_mean| / |original_mean|
Std Preservation: |generated_std - original_std| / original_std
Skewness/Kurtosis: Higher-order moment preservation

Performance Benchmarks

Sample Size	Generation Speed	Mean Error	Std Error	Quality Score
1,000	~860K samples/s	~11%	~2%	55-60
5,000	~3.5M samples/s	~3%	~1%	50-55
15,000	~7.3M samples/s	~1%	~0.1%	53-58

🔬 Advanced Usage

Custom Model Parameters

GMM Configuration

generator.gmm_max_components = 8
generator.gmm_covariance_type = 'full'  # 'full', 'tied', 'diag', 'spherical'
generator.gmm_init_params = 'k-means++'

KDE Configuration

generator.kde_kernels = ['gaussian', 'tophat', 'epanechnikov', 'exponential']
generator.kde_bandwidths = np.logspace(-3, 1, 30)

Integration with ML Pipelines

# Use as data augmentation
def augment_training_data(original_data, augment_factor=2):
    generator = EnsembleDataGenerator()
    generator.fit(original_data)
    
    augmented_size = len(original_data) * augment_factor
    synthetic = generator.generate_samples(augmented_size)
    
    return np.concatenate([original_data, synthetic])

📈 Performance Metrics

Benchmark Results (PPE_262_Tropic.csv)

Model Selection Results

Best GMM: 1 component (BIC: -461.46)
Best KDE: Epanechnikov kernel, bandwidth=0.10 (score: 225.31)
Selected Model: GMM (better validation performance)

Validation Results Summary

Test Set	Sample Size	Quality Score	P-value	Mean Error	Std Error
Test	32	61.4/100	0.968	81.9%	21.3%
Ground	30	53.9/100	0.958	109.2%	14.0%
Demo	1,000	58.6/100	0.440	12.9%	1.3%
Large	15,000	53.2/100	0.417	1.2%	0.1%

Key Insights

✅ Excellent statistical validity: All p-values > 0.4
✅ Scaling improves accuracy: Mean error drops from 109% to 1.2%
✅ High generation speed: Up to 7M+ samples/second
⚠️ Higher-order moments: Skewness/kurtosis preservation needs improvement

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Ensemble Data Generator

🎯 Features

📦 Requirements

System Requirements

Python Dependencies

🔧 Installation

Option 1: Clone and Install

🚀Start

Step 1: Check Your Data

Step 2: Run the Complete Pipeline

Step 3: Generate

Step 4: Plot

📁 Project Structure

📊 Understanding Results

Quality Metrics Explained

Overall Quality Score (0-100)

Statistical Tests

Preservation Metrics

Performance Benchmarks

🔬 Advanced Usage

Custom Model Parameters

GMM Configuration

KDE Configuration

Integration with ML Pipelines

📈 Performance Metrics

Benchmark Results (PPE_262_Tropic.csv)

Model Selection Results

Validation Results Summary

Key Insights

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
ensemble_gen		ensemble_gen
models		models
plots		plots
Plot.ipynb		Plot.ipynb
generation.py		generation.py
readme.md		readme.md
requirements.txt		requirements.txt
train.py		train.py

Shailesh22290/ensemble_gen

Folders and files

Latest commit

History

Repository files navigation

Ensemble Data Generator

🎯 Features

📦 Requirements

System Requirements

Python Dependencies

🔧 Installation

Option 1: Clone and Install

🚀Start

Step 1: Check Your Data

Step 2: Run the Complete Pipeline

Step 3: Generate

Step 4: Plot

📁 Project Structure

📊 Understanding Results

Quality Metrics Explained

Overall Quality Score (0-100)

Statistical Tests

Preservation Metrics

Performance Benchmarks

🔬 Advanced Usage

Custom Model Parameters

GMM Configuration

KDE Configuration

Integration with ML Pipelines

📈 Performance Metrics

Benchmark Results (PPE_262_Tropic.csv)

Model Selection Results

Validation Results Summary

Key Insights

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages