**Ruby-based CSV preprocessing and feature engineering utility
- Team Strength Index: Composite win rate + goal differential metric
- Pythagorean Expectation: Expected wins based on GF²/(GF²+GA²)
- Momentum Scores: Recent performance trends (rolling win rates)
- Rest Advantage: Days between games + back-to-back penalties
- Clutch Factor: Performance in 1-goal games
- Home/Away Splits: Location-based win rate analysis
- Strength of Schedule: Opponent quality adjustments
- Consistency Metrics: Coefficient of variation (low = consistent)
- Interaction Features: offense_power (GF × win%), defense_weakness (GA × losses)
- Polynomial Features: Non-linear relationships (DIFF², PTS²)
- Luck Factor: Actual wins - Pythagorean expected wins
- Time Decay Weights: Recent games weighted higher
- Grid Search: Exhaustive search (2,812 total combinations across 5 models)
- Random Search: Fast sampling for large spaces
- Bayesian Optimization: Gaussian Process with Expected Improvement acquisition
- Genetic Algorithm: Evolution with crossover/mutation/selection
- Simulated Annealing: Temperature-based exploration/exploitation
- Stacking: Meta-learner trained on base model predictions
- Blending: Holdout-based meta-model training
- Weighted Voting: Optimized weights (inverse RMSE, softmax)
- Rank Averaging: Robust to prediction scale differences
- Dynamic Weights: Adaptive based on recent performance
- Diversity Analysis: Checks error correlation (low = good ensemble)
- Time Series CV: Expanding window (no data leakage)
- Stratified Splits: Balanced train/test for classification
- Bootstrap CI: 1000+ iterations for confidence intervals
- Overfitting Detection: Train vs test performance gap monitoring
- Calibration Analysis: Binned predictions vs actuals
- Learning Curves: Performance vs training size
# 1. Full competitive preprocessing pipeline
ruby cli.rb competitive-pipeline data/nhl_data.csv
# 2. Generate optimized hyperparameters (Bayesian search)
ruby cli.rb hyperparam-bayesian config/hyperparams/model4_xgboost.yaml --iterations 50
# 3. Optimize ensemble weights from all models
ruby cli.rb ensemble-optimize predictions/ --actuals test.csv -o weights.csv
---
## 📊 Competition Workflow (Step-by-Step)
### Phase 1: Data Preprocessing (CSVy)
```bash
# Run full competitive pipeline (includes all advanced features)
ruby cli.rb competitive-pipeline data/raw_nhl_data.csv -o data/processed
# Output: competitive_features.csv, train.csv, test.csv
# Model 2: Linear Regression (120 combinations)
ruby cli.rb hyperparam-grid config/hyperparams/model2_linear_regression.yaml
# Model 3: ELO (648 combinations) - Use Bayesian for smarter search
ruby cli.rb hyperparam-bayesian config/hyperparams/model3_elo.yaml --iterations 30
# Model 4: XGBoost (864 combinations) - Genetic algorithm
ruby cli.rb hyperparam-genetic config/hyperparams/model4_xgboost.yaml --population 50 --generations 20
# Model 4: Random Forest (144 combinations)
ruby cli.rb hyperparam-grid config/hyperparams/model4_random_forest.yaml
# Model 5: Ensemble (36 combinations)
ruby cli.rb hyperparam-grid config/hyperparams/model5_ensemble.yamlimport pandas as pd
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
# Load hyperparameter configs
configs = pd.read_csv('model4_xgboost_genetic_algorithm.csv')
# Train each configuration
for idx, row in configs.iterrows():
params = row.to_dict()
experiment_id = params.pop('experiment_id')
# Remove tracking columns
for col in ['rmse', 'mae', 'r2', 'notes', 'timestamp']:
params.pop(col, None)
# Train model
model = XGBRegressor(**params)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Save results
configs.loc[idx, 'rmse'] = rmse
configs.loc[idx, 'mae'] = mae
configs.loc[idx, 'r2'] = r2
# Save updated tracking file
configs.to_csv('model4_xgboost_genetic_algorithm.csv', index=False)# Find best hyperparameters
ruby cli.rb best-params experiments/xgb_genetic.csv --metric rmse
# Compare top experiments
ruby cli.rb compare-experiments experiments/xgb_genetic.csv 10,25,42,78
# View summary statistics
ruby cli.rb diagnose experiments/xgb_genetic.csv# After training all 5 models, save predictions to predictions/
# predictions/baseline_preds.csv
# predictions/linear_preds.csv
# predictions/elo_preds.csv
# predictions/xgboost_preds.csv
# predictions/rf_preds.csv
# Optimize ensemble weights
ruby cli.rb ensemble-optimize predictions/ --actuals data/test_actuals.csv -o optimal_weights.csv
# Check model diversity (should be > 0.5 for good ensemble)
ruby cli.rb diversity-analysis predictions/ data/test_actuals.csv# Create weighted ensemble predictions (using optimal weights)
# Then validate with bootstrap and calibration
ruby cli.rb validate-model final_ensemble_preds.csv \
--bootstrap \
--calibration \
--actual_col actual \
--pred_col predicted| Command | Purpose | Example |
|---|---|---|
competitive-pipeline |
Full preprocessing + 14+ advanced features | ruby cli.rb competitive-pipeline data/nhl.csv -o data/processed |
advanced-features |
Add competition features only | ruby cli.rb advanced-features data/nhl.csv -o data/advanced.csv |
diagnose |
Data quality analysis | ruby cli.rb diagnose data/nhl.csv |
clean |
Remove duplicates & handle missing | ruby cli.rb clean data/nhl.csv -o data/clean.csv |
| Command | Purpose | When to Use |
|---|---|---|
hyperparam-grid |
Exhaustive search | Small grids (<1000 combinations) |
hyperparam-random |
Random sampling | Large spaces, quick exploration |
hyperparam-bayesian |
Gaussian Process optimization | Expensive models (XGBoost), 20-50 iterations |
hyperparam-genetic |
Evolution-based search | Large spaces, 50+ population, 20+ generations |
hyperparam-annealing |
Simulated annealing | Continuous spaces, good for fine-tuning |
| Command | Purpose | Example |
|---|---|---|
add-result |
Record experiment metrics | ruby cli.rb add-result experiments/grid.csv 42 --rmse 2.34 --mae 1.87 --r2 0.82 |
best-params |
Find optimal hyperparameters | ruby cli.rb best-params experiments/grid.csv --metric rmse |
compare-experiments |
Compare specific runs | ruby cli.rb compare-experiments experiments/grid.csv 10,25,42 |
| Command | Purpose | Example |
|---|---|---|
ensemble-optimize |
Find optimal model weights | ruby cli.rb ensemble-optimize predictions/ --actuals test.csv -o weights.csv |
diversity-analysis |
Check ensemble diversity | ruby cli.rb diversity-analysis predictions/ actuals.csv |
validate-model |
Bootstrap + calibration | ruby cli.rb validate-model preds.csv --bootstrap --calibration |
- Purpose: Establish baseline performance
- Method: Simple mean/median prediction
- Config:
config/hyperparams/model1_baseline.yaml - Expected RMSE: 3-4 goals
- Purpose: Capture linear relationships
- Features: Ridge/ElasticNet regularization, polynomial features, scaling
- Config:
config/hyperparams/model2_linear_regression.yaml - Hyperparameters: alpha (0.001-10), l1_ratio (0-1), solver, poly_degree (1-2)
- Expected RMSE: 2.5-3.0 goals
- Purpose: Team strength dynamics
- Features: Custom ELO with MOV adjustments, home advantage, rest/b2b
- Config:
config/hyperparams/model3_elo.yaml - Hyperparameters: k_factor (20-40), home_advantage (50-150), MOV multipliers
- Expected RMSE: 2.2-2.7 goals
- Purpose: Capture non-linear interactions
- Features: Gradient boosting + Random Forest diversity
- Configs:
config/hyperparams/model4_xgboost.yamlconfig/hyperparams/model4_random_forest.yaml
- Hyperparameters:
- XGBoost: learning_rate, n_estimators, max_depth, regularization
- RF: n_estimators, max_depth, min_samples, max_features
- Expected RMSE: 1.8-2.3 goals (best individual model)
- Purpose: Combine all models for maximum accuracy
- Methods: Stacking (meta-learner), weighted voting, blending
- Config:
config/hyperparams/model5_ensemble.yaml - Hyperparameters: weight_method (inverse_rmse/softmax), meta_learner (ridge/elastic_net)
- Expected RMSE: 1.5-1.9 goals (🏆 WINNING MODEL)
# Use ALL advanced features
ruby cli.rb competitive-pipeline data/nhl.csv
# Features that win:
# - team_strength_index (composite metric)
# - pythagorean_wins (expected vs actual)
# - momentum_score (hot/cold streaks)
# - clutch_factor (close game performance)
# - rest_days + is_back_to_back
# - home_away_diff# Start with random search (fast exploration)
ruby cli.rb hyperparam-random config/hyperparams/model4_xgboost.yaml 100
# Then Bayesian optimization (smart exploitation)
ruby cli.rb hyperparam-bayesian config/hyperparams/model4_xgboost.yaml --iterations 50
# For final tuning: genetic algorithm
ruby cli.rb hyperparam-genetic config/hyperparams/model4_xgboost.yaml --population 50 --generations 30# Check diversity before ensembling
ruby cli.rb diversity-analysis predictions/ actuals.csv
# Good: diversity_score > 0.5 (models are complementary)
# Bad: diversity_score < 0.3 (models too similar, ensemble won't help)- Use
time_series_cv_splitin validation (expanding window) - Never use random splits for time series data
- Recent games should be test set, not training
# Check if predictions are well-calibrated
ruby cli.rb validate-model preds.csv --calibration
# Well-calibrated: mean_calibration_error < 0.1
# Poorly calibrated: predictions systematically over/under# Know your uncertainty
ruby cli.rb validate-model preds.csv --bootstrap
# Tight CI (< 0.2): High confidence
# Wide CI (> 0.5): Model unstable- Train RMSE - Test RMSE should be < 0.3
- If gap > 0.5: Reduce model complexity, add regularization
- Use learning curves to diagnose
CSVy/
├── lib/ # Core libraries
│ ├── advanced_features.rb # 14+ competition features
│ ├── model_validator.rb # CV, bootstrap, calibration
│ ├── ensemble_builder.rb # Stacking, blending, optimization
│ ├── hyperparameter_manager.rb # 5 optimization algorithms
│ ├── time_series_features.rb # Rolling, EWMA, lag
│ ├── csv_cleaner.rb # Data cleaning
│ ├── data_preprocessor.rb # Normalization, encoding
│ ├── csv_diagnostics.rb # Quality analysis
│ └── html_reporter.rb # Diagnostic reports
├── config/hyperparams/ # Model configurations
│ ├── model1_baseline.yaml # No hyperparams
│ ├── model2_linear_regression.yaml # 120 combinations
│ ├── model3_elo.yaml # 648 combinations
│ ├── model4_xgboost.yaml # 864 combinations
│ ├── model4_random_forest.yaml # 144 combinations
│ └── model5_ensemble.yaml # 36 combinations
├── scripts/
│ ├── competitive_pipeline.rb # Full preprocessing pipeline
│ └── preprocess_hockey.sh # Batch preprocessing
├── data/
│ ├── sample_nhl_standings.csv # Test data
│ └── processed/ # Output directory
├── experiments/ # Tracking files
├── cli.rb # 40+ CLI commands
├── README.md # This file
└── Gemfile # Dependencies
# ===== PHASE 1: PREPROCESSING =====
ruby cli.rb competitive-pipeline data/nhl_season_2024.csv -o data/processed
# Output: competitive_features.csv (with 14+ advanced features)
# ===== PHASE 2: HYPERPARAMETER GENERATION =====
# Generate configs for all 5 models
ruby cli.rb hyperparam-grid config/hyperparams/model2_linear_regression.yaml
ruby cli.rb hyperparam-bayesian config/hyperparams/model3_elo.yaml --iterations 30
ruby cli.rb hyperparam-genetic config/hyperparams/model4_xgboost.yaml --population 50 --generations 20
ruby cli.rb hyperparam-grid config/hyperparams/model4_random_forest.yaml
ruby cli.rb hyperparam-grid config/hyperparams/model5_ensemble.yaml
# ===== PHASE 3: PUSH TO GITHUB =====
git add .
git commit -m "Add competitive features and hyperparameter grids"
git push origin main
# ===== PHASE 4: TRAIN IN DEEPNOTE (Python) =====
# (See Phase 3 example above)
# Train all models, record rmse/mae/r2 in tracking CSVs
# ===== PHASE 5: PULL RESULTS =====
git pull origin main
# ===== PHASE 6: FIND BEST PARAMS =====
ruby cli.rb best-params experiments/xgb_genetic.csv --metric rmse
ruby cli.rb best-params experiments/rf_grid.csv --metric rmse
ruby cli.rb best-params experiments/elo_bayesian.csv --metric rmse
# ===== PHASE 7: ENSEMBLE OPTIMIZATION =====
# After generating predictions from all 5 models
ruby cli.rb ensemble-optimize predictions/ --actuals data/test_actuals.csv -o optimal_weights.csv
ruby cli.rb diversity-analysis predictions/ data/test_actuals.csv
# ===== PHASE 8: FINAL VALIDATION =====
ruby cli.rb validate-model final_ensemble_preds.csv --bootstrap --calibration| Model | RMSE | MAE | R² | Notes |
|---|---|---|---|---|
| Baseline | 3.5 | 2.8 | 0.40 | Benchmark |
| Linear Regression | 2.7 | 2.1 | 0.65 | With poly features |
| ELO Rating | 2.4 | 1.9 | 0.72 | With MOV adjustments |
| XGBoost | 2.0 | 1.6 | 0.82 | Best individual |
| Random Forest | 2.2 | 1.7 | 0.78 | Good diversity |
| Ensemble | 1.7 | 1.3 | 0.87 | 🏆 WINNING |
Performance improves by 50% from baseline to ensemble!
- Ruby 2.7+
- Thor gem (
gem install thor) - Standard library: CSV, Logger, FileUtils, Date
bundle install- Create DeepNote project
- Connect GitHub repository
- Set up automatic sync (webhook)
# Local (CSVy)
ruby cli.rb competitive-pipeline data/nhl.csv
git push
# DeepNote (auto-pulls from GitHub)
# Train models in Python
git push # Push results back
# Local (CSVy)
git pull
ruby cli.rb best-params experiments/grid.csv --metric rmseCSVy contains everything needed to dominate hockey prediction competitions:
40+ CLI commands
14+ advanced features (momentum, clutch, pythagorean, strength index)
5 optimization algorithms (Grid, Random, Bayesian, Genetic, Annealing)
Ensemble methods (Stacking, Blending, Weighted voting)
Model validation (Bootstrap CI, Calibration, Time series CV)
6 hyperparameter configs (2,812 total combinations)
Complete preprocessing pipeline
Diversity analysis (Ensure complementary models)
Overfitting detection (Train vs test monitoring)
- Mathematical rigor: Bayesian optimization, Gaussian Processes, Expected Improvement
- Domain expertise: Hockey-specific features (rest days, b2b, clutch, home/away)
- Ensemble intelligence: Optimal weight optimization, diversity checks
- No data leakage: Time series CV with expanding window
- Production-ready: Auto-directory creation, comprehensive logging
# List all commands
ruby cli.rb help
# Help for specific command
ruby cli.rb help competitive-pipeline
ruby cli.rb help hyperparam-bayesian
ruby cli.rb help ensemble-optimizeLet's win this!
x_standardized = (x - μ) / σ
where μ = mean, σ = standard deviation
- Centers data around 0 with std dev of 1
- Example: [10, 20, 30] → [-1.22, 0.0, 1.22]
-
Missing Value Handling: Multiple strategies
mean: Fill with column averagemedian: Fill with middle value (robust to outliers)mode: Fill with most frequent valueforward_fill: Propagate last valid valuebackward_fill: Use next valid value
-
Outlier Detection
- IQR Method:
outlier if x < Q1 - 1.5×IQR or x > Q3 + 1.5×IQR- Q1 = 25th percentile, Q3 = 75th percentile
- IQR = Q3 - Q1 (Interquartile Range)
- Z-Score Method:
outlier if |z| > 3- Removes values more than 3 standard deviations from mean
- IQR Method:
-
Binning: Convert continuous → categorical
- Example: Age [0-18, 19-35, 36-60, 61+] → ['child', 'young', 'adult', 'senior']
-
Descriptive Statistics:
Mean (μ) = Σx / n Median = middle value when sorted Mode = most frequent value Std Dev (σ) = √(Σ(x - μ)² / n) Variance (σ²) = Σ(x - μ)² / n -
Quartiles & Percentiles:
- Q1 (25th percentile): 25% of data below
- Q2 (50th percentile): Median
- Q3 (75th percentile): 75% of data below
-
Data Validation: Comprehensive quality checks
- Empty rows detection
- Duplicate identification
- Missing value analysis
- Data type inference
-
Data Profiling: Column-level metrics
- Cardinality:
unique_values / total_values - Missing rate:
missing_count / total_count × 100% - Distribution analysis
- Cardinality:
-
Integrity Checks: Custom validation rules
- Range validation:
min ≤ value ≤ max - Pattern matching: Regex validation
- Uniqueness constraints
- Enum validation
- Range validation:
- Daru Integration: Dataframe operations (filter, sort, group, aggregate)
- SQLite Database: Import/export CSV data, run SQL queries
- CLI Interface: User-friendly command-line interface
- Comprehensive Testing: Full RSpec test suite with 90%+ coverage
- Rake Tasks: Automated workflows and examples
- Language: Ruby
- CSV Handling: Ruby CSV (Standard Library)
- Data Manipulation: Daru (Optional, for advanced operations)
- CLI Framework: Thor
- Testing: RSpec
- Environment Management: Dotenv
- Database (Optional): SQLite3
- Ruby 2.7 or higher
- Bundler
- Clone the repository:
git clone https://github.com/smiles0527/CSVy.git
cd CSVy- Install dependencies:
gem install bundler
bundle install- Try with sample hockey data:
ruby cli.rb info data/sample_employees.csv
ruby cli.rb validate data/sample_students_dirty.csv# Install dependencies
bundle install
# Run examples to see features in action
rake examples
# Or run individual examples
rake example_validate
rake example_clean# Clean player statistics with missing values
ruby cli.rb clean data/sample_students_dirty.csv# One-hot encode player positions
ruby cli.rb encode data/sample_employees.csv position -t onehot
# Normalize goals scored across season
ruby cli.rb normalize data/sample_employees.csv goals -m minmax
# Standardize plus/minus ratings
ruby cli.rb normalize data/sample_employees.csv plus_minus -m zscore# Validate player data quality
ruby cli.rb validate data/sample_employees.csv
# Generate statistics for team performance
ruby cli.rb stats data/sample_products.csv
# Profile game results dataset
ruby cli.rb profile data/sample_weather.csv# Merge player stats from multiple seasons
ruby cli.rb merge season1.csv season2.csv -o combined_stats.csv
# Display team standings information
ruby cli.rb info data/sample_products.csv# Import player data to database
ruby cli.rb db-import data/sample_employees.csv players
# Import team standings
ruby cli.rb db-import data/sample_products.csv teams
# Query top scorers
ruby cli.rb db-query "SELECT * FROM players WHERE goals > 50 ORDER BY goals DESC"
# List all hockey data tables
ruby cli.rb db-tablesrequire_relative 'lib/csv_processor'
require_relative 'lib/csv_cleaner'
require_relative 'lib/csv_merger'
# Clean a CSV file
CSVProcessor.clean('data/input.csv')
# Merge CSV files
CSVProcessor.merge('data/file1.csv', 'data/file2.csv', 'output.csv')
# Advanced cleaning
cleaner = CSVCleaner.new('data/input.csv')
cleaned_data = cleaner.clean_data
cleaner.normalize_column(cleaned_data, 'age')
cleaner.save_to_csv(cleaned_data, 'cleaned_output.csv')
# Advanced merging
merger = CSVMerger.new
merged = merger.join_on_column('file1.csv', 'file2.csv', key_column: 'id')
merger.save_to_csv(merged, 'joined.csv')Run the test suite:
bundle exec rspecRun specific test file:
bundle exec rspec spec/csv_processor_spec.rbRun with verbose output:
bundle exec rspec --format documentationCSVy/
├── lib/
│ ├── csv_processor.rb # Main CSV processing orchestrator
│ ├── csv_cleaner.rb # Data cleaning functions
│ ├── csv_merger.rb # CSV merging operations
│ ├── data_preprocessor.rb # One-hot encoding, normalization, outlier removal
│ ├── data_validator.rb # Validation, statistics, profiling
│ ├── dataframe_handler.rb # Daru dataframe operations
│ └── database_manager.rb # SQLite database integration
├── spec/
│ ├── csv_processor_spec.rb
│ ├── csv_cleaner_spec.rb
│ ├── csv_merger_spec.rb
│ ├── data_preprocessor_spec.rb
│ ├── data_validator_spec.rb
│ └── spec_helper.rb
├── data/ # Sample CSV files included
│ ├── sample_employees.csv
│ ├── sample_products.csv
│ ├── sample_weather.csv
│ └── sample_students_dirty.csv
├── cli.rb # Command-line interface
├── Rakefile # Rake tasks for automation
├── Gemfile # Ruby dependencies
├── .gitignore
├── .env.example
├── USAGE_GUIDE.md # Comprehensive usage examples
└── README.md
ruby cli.rb [command] [options]- Create new methods in appropriate lib files
- Add corresponding tests in spec/ directory
- Update CLI commands in cli.rb if needed
- Run tests to ensure everything works
csv- Built-in CSV handlingthor- CLI frameworkdaru- Dataframe operations (optional)rspec- Testing frameworkdotenv- Environment variable managementsqlite3- Database support (optional)
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is open source and available under the MIT License.
smiles0527
- Ruby CSV Standard Library
- Thor CLI Framework
- RSpec Testing Framework
- The Ruby community
For issues, questions, or contributions, please open an issue on GitHub.
CSVy - Professional CSV processing and organization tool for Ruby.