A sophisticated statistical modeling system that generates synthetic datasets while preserving the statistical properties of original data using Gaussian Mixture Models (GMM) and Kernel Density Estimation (KDE).
- Dual Model Architecture: Implements both GMM and KDE for comprehensive data modeling
- Automated Model Selection: Uses BIC/AIC criteria for optimal model selection
- Comprehensive Validation: Multi-level statistical validation framework
- Scalable Generation: Generate from hundreds to thousands of synthetic samples
- Statistical Preservation: Maintains mean, variance, skewness, and kurtosis
- Performance Monitoring: Real-time generation speed and quality metrics
- Rich Visualizations: 5 comprehensive visualization suites for analysis
- Cross-Platform: Works on Windows, macOS, and Linux
- Python 3.8+ (tested on Python 3.11, 3.12, 3.13)
- 4GB+ RAM recommended
- 1GB+ free disk space
numpy>=1.21.0
pandas>=1.3.0
scikit-learn>=1.1.0
matplotlib>=3.5.0
seaborn>=0.11.0
scipy>=1.8.0
joblib>=1.1.0# Clone the repository
git clone https://github.com/your-username/ensemble-data-generator.git
cd ensemble-data-generator
# Create virtual environment
python -m venv ensemble_env
source ensemble_env/bin/activate
# Install dependencies
pip install -r requirements.txtPlace your CSV data file in the data/ directory. The system expects:
- Single column of numerical data
- CSV format with headers
- Example:
data/PPE_262_Tropic.csv
Change the path of your file at Line 537 in train.py
python Train.pyThis will automatically:
- β Load and analyze your data
- β Train GMM and KDE models
- β Select the best performing model
- β Generate synthetic datasets (1K, 5K, 15K samples)
- β Validate all results
- β Save models and reports
Change the path of your file at Line 417 and Line 497 in Generation.py
python generation.pyRun the Plots.ipynb place the path of the generated CSV file
This creates 5 comprehensive visualization suites showing your results.
ensemble_project/
βββ π README.md # This file
βββ π train.py # Main execution script
βββ π¨ generation.py # Visualization generation
βββ π requirements.txt # Dependencies
βββ π data/
β βββ PPE_262_Tropic.csv # Input data # Helper functions
βββ π models/
β βββ tranied_model.pkl # Trained models
β βββ reports/
βββ π ensemble_gen/
β βββCESM/ # Contains CESM 500, 1000, 5000 Generated CSVs and their reports
β βββMPI/ # Contains MPI 500, 1000, 5000 Generated CSVs and their reports
β βββPPE/ # Contains PPE 500, 1000, 5000 Generated CSVs and their reports
β
βββ πΈ plots/
β βββ visualization_outputs/ # Generated plots
β
βββ Plots.ipynb # Code to generate Plots
- EXCELLENT (90-100): Outstanding preservation of all statistical properties
- GOOD (70-89): Good preservation with minor deviations
- FAIR (50-69): Acceptable quality with some limitations
- POOR (0-49): Significant deviations, needs improvement
- Distribution Match (p-value): Kolmogorov-Smirnov test
- p > 0.05: Generated data follows original distribution β
- p β€ 0.05: Significant distribution difference β
- Mean Preservation:
|generated_mean - original_mean| / |original_mean| - Std Preservation:
|generated_std - original_std| / original_std - Skewness/Kurtosis: Higher-order moment preservation
| Sample Size | Generation Speed | Mean Error | Std Error | Quality Score |
|---|---|---|---|---|
| 1,000 | ~860K samples/s | ~11% | ~2% | 55-60 |
| 5,000 | ~3.5M samples/s | ~3% | ~1% | 50-55 |
| 15,000 | ~7.3M samples/s | ~1% | ~0.1% | 53-58 |
generator.gmm_max_components = 8
generator.gmm_covariance_type = 'full' # 'full', 'tied', 'diag', 'spherical'
generator.gmm_init_params = 'k-means++'generator.kde_kernels = ['gaussian', 'tophat', 'epanechnikov', 'exponential']
generator.kde_bandwidths = np.logspace(-3, 1, 30)# Use as data augmentation
def augment_training_data(original_data, augment_factor=2):
generator = EnsembleDataGenerator()
generator.fit(original_data)
augmented_size = len(original_data) * augment_factor
synthetic = generator.generate_samples(augmented_size)
return np.concatenate([original_data, synthetic])- Best GMM: 1 component (BIC: -461.46)
- Best KDE: Epanechnikov kernel, bandwidth=0.10 (score: 225.31)
- Selected Model: GMM (better validation performance)
| Test Set | Sample Size | Quality Score | P-value | Mean Error | Std Error |
|---|---|---|---|---|---|
| Test | 32 | 61.4/100 | 0.968 | 81.9% | 21.3% |
| Ground | 30 | 53.9/100 | 0.958 | 109.2% | 14.0% |
| Demo | 1,000 | 58.6/100 | 0.440 | 12.9% | 1.3% |
| Large | 15,000 | 53.2/100 | 0.417 | 1.2% | 0.1% |
- β Excellent statistical validity: All p-values > 0.4
- β Scaling improves accuracy: Mean error drops from 109% to 1.2%
- β High generation speed: Up to 7M+ samples/second
β οΈ Higher-order moments: Skewness/kurtosis preservation needs improvement