Skip to content

Shailesh22290/ensemble_gen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Ensemble Data Generator

A sophisticated statistical modeling system that generates synthetic datasets while preserving the statistical properties of original data using Gaussian Mixture Models (GMM) and Kernel Density Estimation (KDE).

🎯 Features

  • Dual Model Architecture: Implements both GMM and KDE for comprehensive data modeling
  • Automated Model Selection: Uses BIC/AIC criteria for optimal model selection
  • Comprehensive Validation: Multi-level statistical validation framework
  • Scalable Generation: Generate from hundreds to thousands of synthetic samples
  • Statistical Preservation: Maintains mean, variance, skewness, and kurtosis
  • Performance Monitoring: Real-time generation speed and quality metrics
  • Rich Visualizations: 5 comprehensive visualization suites for analysis
  • Cross-Platform: Works on Windows, macOS, and Linux

πŸ“¦ Requirements

System Requirements

  • Python 3.8+ (tested on Python 3.11, 3.12, 3.13)
  • 4GB+ RAM recommended
  • 1GB+ free disk space

Python Dependencies

numpy>=1.21.0
pandas>=1.3.0
scikit-learn>=1.1.0
matplotlib>=3.5.0
seaborn>=0.11.0
scipy>=1.8.0
joblib>=1.1.0

πŸ”§ Installation

Option 1: Clone and Install

# Clone the repository
git clone https://github.com/your-username/ensemble-data-generator.git
cd ensemble-data-generator

# Create virtual environment
python -m venv ensemble_env
source ensemble_env/bin/activate  

# Install dependencies
pip install -r requirements.txt

πŸš€Start

Step 1: Check Your Data

Place your CSV data file in the data/ directory. The system expects:

  • Single column of numerical data
  • CSV format with headers
  • Example: data/PPE_262_Tropic.csv

Step 2: Run the Complete Pipeline

Change the path of your file at Line 537 in train.py

python Train.py

This will automatically:

  • βœ… Load and analyze your data
  • βœ… Train GMM and KDE models
  • βœ… Select the best performing model
  • βœ… Generate synthetic datasets (1K, 5K, 15K samples)
  • βœ… Validate all results
  • βœ… Save models and reports

Step 3: Generate

Change the path of your file at Line 417 and Line 497 in Generation.py

python generation.py

Step 4: Plot

Run the Plots.ipynb place the path of the generated CSV file

This creates 5 comprehensive visualization suites showing your results.

πŸ“ Project Structure

ensemble_project/
β”œβ”€β”€ πŸ“„ README.md                          # This file
β”œβ”€β”€ 🐍 train.py                     # Main execution script
β”œβ”€β”€ 🎨 generation.py             # Visualization generation
β”œβ”€β”€ πŸ“‹ requirements.txt                   # Dependencies
β”œβ”€β”€ πŸ“Š data/
β”‚   └── PPE_262_Tropic.csv               # Input data                       # Helper functions
β”œβ”€β”€ πŸ“ˆ models/
β”‚   β”œβ”€β”€ tranied_model.pkl # Trained models
β”‚   β”œβ”€β”€ reports/
β”œβ”€β”€ πŸ“ˆ ensemble_gen/
β”‚   │──CESM/ # Contains CESM 500, 1000, 5000 Generated CSVs and their reports
β”‚   │──MPI/ # Contains MPI 500, 1000, 5000 Generated CSVs and their reports
β”‚   └──PPE/ # Contains PPE 500, 1000, 5000 Generated CSVs and their reports
β”‚   
└── πŸ“Έ plots/
β”‚     └── visualization_outputs/            # Generated plots
β”‚   
└── Plots.ipynb  # Code to generate Plots    

πŸ“Š Understanding Results

Quality Metrics Explained

Overall Quality Score (0-100)

  • EXCELLENT (90-100): Outstanding preservation of all statistical properties
  • GOOD (70-89): Good preservation with minor deviations
  • FAIR (50-69): Acceptable quality with some limitations
  • POOR (0-49): Significant deviations, needs improvement

Statistical Tests

  • Distribution Match (p-value): Kolmogorov-Smirnov test
    • p > 0.05: Generated data follows original distribution βœ…
    • p ≀ 0.05: Significant distribution difference ❌

Preservation Metrics

  • Mean Preservation: |generated_mean - original_mean| / |original_mean|
  • Std Preservation: |generated_std - original_std| / original_std
  • Skewness/Kurtosis: Higher-order moment preservation

Performance Benchmarks

Sample Size Generation Speed Mean Error Std Error Quality Score
1,000 ~860K samples/s ~11% ~2% 55-60
5,000 ~3.5M samples/s ~3% ~1% 50-55
15,000 ~7.3M samples/s ~1% ~0.1% 53-58

πŸ”¬ Advanced Usage

Custom Model Parameters

GMM Configuration

generator.gmm_max_components = 8
generator.gmm_covariance_type = 'full'  # 'full', 'tied', 'diag', 'spherical'
generator.gmm_init_params = 'k-means++'

KDE Configuration

generator.kde_kernels = ['gaussian', 'tophat', 'epanechnikov', 'exponential']
generator.kde_bandwidths = np.logspace(-3, 1, 30)

Integration with ML Pipelines

# Use as data augmentation
def augment_training_data(original_data, augment_factor=2):
    generator = EnsembleDataGenerator()
    generator.fit(original_data)
    
    augmented_size = len(original_data) * augment_factor
    synthetic = generator.generate_samples(augmented_size)
    
    return np.concatenate([original_data, synthetic])

πŸ“ˆ Performance Metrics

Benchmark Results (PPE_262_Tropic.csv)

Model Selection Results

  • Best GMM: 1 component (BIC: -461.46)
  • Best KDE: Epanechnikov kernel, bandwidth=0.10 (score: 225.31)
  • Selected Model: GMM (better validation performance)

Validation Results Summary

Test Set Sample Size Quality Score P-value Mean Error Std Error
Test 32 61.4/100 0.968 81.9% 21.3%
Ground 30 53.9/100 0.958 109.2% 14.0%
Demo 1,000 58.6/100 0.440 12.9% 1.3%
Large 15,000 53.2/100 0.417 1.2% 0.1%

Key Insights

  • βœ… Excellent statistical validity: All p-values > 0.4
  • βœ… Scaling improves accuracy: Mean error drops from 109% to 1.2%
  • βœ… High generation speed: Up to 7M+ samples/second
  • ⚠️ Higher-order moments: Skewness/kurtosis preservation needs improvement

Releases

No releases published

Packages

No packages published