Skip to content

solitude6060/Go-Rank-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🎯 Go Rank Prediction System

A comprehensive machine learning system for predicting Go player ranks (1D-9D) from game move sequences. This project implements a multi-model ensemble approach combining traditional ML and deep learning techniques to achieve state-of-the-art performance.

License: MIT Python 3.10+ PyTorch


πŸ“‹ Table of Contents


✨ Features

🎨 Multi-Model Ensemble

  • 9 different model architectures ranging from traditional ML to cutting-edge deep learning
  • Automatic ensemble with probability averaging for robust predictions
  • Support for both tabular and sequence-based approaches

πŸ”§ Comprehensive Feature Engineering

  • ~389 base features with statistical aggregations
  • ~48 enhanced features with domain-specific Go metrics
  • ~437 combined features for maximum model capacity
  • Temporal dynamics, phase-specific patterns, and strategic indicators

πŸš€ Production-Ready Pipeline

  • Automatic model training if artifacts are missing
  • Progress tracking with tqdm progress bars
  • Cross-validation and early stopping
  • GPU acceleration with automatic fallback to CPU

πŸ“Š Advanced Sequence Modeling

  • Handles variable-length sequences (50-8000+ moves)
  • Long-sequence support with S4, PatchTST, and BigBird
  • Multiple chunking strategies for memory efficiency
  • Data augmentation with overlapping sub-games

πŸš€ Quick Start

# Clone the repository
git clone https://github.com/solitude6060/Go-Rank-Prediction.git
cd Go-Rank-Prediction

# Install dependencies
pip install -r requirements.txt

# Train a baseline model
python training.py --model xgboost_v2 --feature-mode combined

# Generate submission file
python Q5.py --model auto

# Output: submission.csv and submission_meta.json

Note: Training data (train_set/) and test data (test_set/) are not included in this repository due to size constraints. Place them in the project root directory before running.


πŸ“¦ Installation

Prerequisites

  • Python 3.10 or higher
  • CUDA-compatible GPU (optional, for faster training)
  • 8GB+ RAM recommended

Dependencies

# Core dependencies
pip install numpy pandas scikit-learn xgboost torch tqdm

# Optional: For long-sequence models
pip install einops transformers

Or install all at once:

pip install -r requirements.txt

πŸ“ Project Structure

Go-Rank-Prediction/
β”œβ”€β”€ README.md                          # This file
β”œβ”€β”€ README.txt                         # Plain text documentation backup
β”œβ”€β”€ LICENSE                            # MIT License
β”œβ”€β”€ requirements.txt                   # Python dependencies
β”œβ”€β”€ .gitignore                         # Git ignore rules
β”‚
β”œβ”€β”€ Q5.py                              # Main inference script
β”œβ”€β”€ training.py                        # Main training script
β”‚
β”œβ”€β”€ src/                               # Core source modules
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ data.py                        # Data parsing & loading
β”‚   β”œβ”€β”€ features.py                    # Base feature engineering (~389 features)
β”‚   β”œβ”€β”€ features_enhanced.py           # Enhanced features (~48 features)
β”‚   β”œβ”€β”€ augmentation.py                # Data augmentation utilities
β”‚   β”œβ”€β”€ models_sklearn.py              # Sklearn models (Logistic Regression)
β”‚   β”œβ”€β”€ models_xgb.py                  # XGBoost (base)
β”‚   β”œβ”€β”€ models_xgb_enhanced.py         # XGBoost v2 (with enhanced features)
β”‚   β”œβ”€β”€ models_torch.py                # PyTorch models (MLP/GRU/Transformer)
β”‚   └── models_longseq.py              # Long-sequence models (S4/PatchTST/BigBird)
β”‚
β”œβ”€β”€ submissions/                       # Generated submission files
β”‚   β”œβ”€β”€ submission_*.csv
β”‚   └── submission_meta.json
β”‚
β”œβ”€β”€ tests/                             # Test and debug scripts
β”‚   β”œβ”€β”€ debug_device.py
β”‚   β”œβ”€β”€ test_gpu.py
β”‚   └── test_game_split.sh
β”‚
β”œβ”€β”€ utils/                             # Utility scripts
β”‚   β”œβ”€β”€ check_data_stats.py            # Dataset statistics
β”‚   β”œβ”€β”€ compare_results.py             # Compare submission files
β”‚   β”œβ”€β”€ diagnose_features.py           # Feature distribution analysis
β”‚   └── data_utils_spilt.py            # Game splitting utilities
β”‚
β”œβ”€β”€ scripts/                           # Experiment automation
β”‚   β”œβ”€β”€ run_all_experiments_with_game_split.sh
β”‚   └── run_key_experiments_game_split.sh
β”‚
β”œβ”€β”€ docs/                              # Additional documentation
β”‚   β”œβ”€β”€ EXPERIMENTS_README.md          # Experiment tracking guide
β”‚   └── QUICK_START.md                 # Quick start tutorial
β”‚
β”œβ”€β”€ submission_sample.csv              # Sample submission format
β”‚
β”œβ”€β”€ train_set/                         # Training data (not in repo - too large)
β”œβ”€β”€ test_set/                          # Test data (not in repo - too large)
└── models/                            # Trained models (not in repo - too large)

πŸ“š Usage

Training Models

The training.py script provides a unified CLI for training all supported models.

Basic Models

# Logistic Regression baseline
python training.py --model sklearn

# XGBoost (base features)
python training.py --model xgboost

# XGBoost v2 (enhanced features - recommended)
python training.py --model xgboost_v2 --feature-mode combined

Deep Learning Models

# Multi-Layer Perceptron on aggregated features
python training.py --model mlp --epochs 50 --batch-size 32

# GRU sequence model
python training.py --model gru --max-len 512 --epochs 20

# Transformer sequence model
python training.py --model transformer --batch-size 8 --epochs 15

Long-Sequence Models

For games with 1000+ moves:

# S4 with Multi-Instance Learning
python training.py --model s4_mil --max-len 2048 --batch-size 16

# PatchTST encoder + XGBoost head
python training.py --model patchtst_xgb --max-len 2048

# BigBird hierarchical attention
python training.py --model bigbird_hier --max-len 8192 --batch-size 4

Advanced Training Options

# With data augmentation
python training.py --model gru --augment-data

# With game splitting
python training.py --model xgboost_v2 --use-game-split

# With cross-validation
python training.py --model xgboost_v2 --xgb-use-cv --xgb-n-splits 5

# Custom hyperparameters
python training.py --model transformer \
    --max-len 1024 \
    --batch-size 16 \
    --epochs 30 \
    --lr 1e-3 \
    --weight-decay 1e-4

Generating Submissions

The Q5.py script generates Kaggle-ready submission files.

Auto Ensemble (Recommended)

Automatically uses all available trained models:

python Q5.py --model auto

This will:

  1. Load all trained models from models/
  2. Generate predictions from each model
  3. Average probability distributions
  4. Output submission.csv and submission_meta.json

Single Model

Use a specific model for inference:

# Use XGBoost v2
python Q5.py --model xgboost_v2

# Use Transformer
python Q5.py --model transformer

# Use GRU
python Q5.py --model gru

Custom Paths

python Q5.py \
    --data-dir /path/to/data \
    --models-dir /path/to/models \
    --submission my_submission.csv

🧠 Model Architecture

Traditional Machine Learning

Model Description Training Time Features Used
sklearn Logistic Regression baseline ~1 min ~389
xgboost Gradient Boosting (base) ~5 min ~389
xgboost_v2 Enhanced XGBoost ~10 min 389/48/437

Deep Learning - Tabular

Model Description Training Time Input
mlp 3-layer MLP ~5 min ~389 features

Deep Learning - Sequence

Model Description Training Time Max Length
gru Bidirectional GRU + Attention ~15 min 512-1024
transformer Multi-head Self-Attention ~20 min 512-1024

Deep Learning - Long Sequence

Model Description Training Time Max Length
s4_mil Structured State Space + MIL ~30 min 2048-4096
patchtst_xgb Patch-based TST + XGB head ~25 min 2048-4096
bigbird_hier Sparse Attention (hierarchical) ~40 min 4096-8192

πŸ”¬ Feature Engineering

Per-Move Raw Features (376 dimensions)

Each move in a Go game is represented by:

  • Basic move info (9 features)

    • Color (black/white)
    • Coordinates (x, y)
    • Is pass move
    • Move fraction (position in game)
    • Strength, winrate, lead, uncertainty
  • Policy distribution (9 values)

    • Predicted rank probabilities for this move
  • Value estimates (9 values)

    • Win probability estimates per rank
  • Rank probabilities (9 values)

    • Neural network rank predictions

Aggregated Features

Base Features (~389)

Generated by features.py:

  • Global statistics: num_moves, white_fraction
  • Coordinate stats: mean, std, min, max, median, IQR for x/y
  • Per-feature aggregations: 376 features Γ— 6 stats each
  • Temporal trends: winrate_trend, lead_trend, strength_trend
  • Phase-specific: opening/endgame statistics
  • Volatility metrics: move quality variation

Enhanced Features (~48)

Generated by features_enhanced.py:

  • Opening phase (first 20 moves): pattern recognition
  • Middle game (moves 21-100): dynamic analysis
  • Endgame (last 20 moves): precision metrics
  • Consistency metrics: move quality stability
  • Strategic indicators: aggressive/defensive indices

Combined Features (~437)

Base + Enhanced for maximum model capacity.

Feature Modes (XGBoost v2)

# Base features only (~389)
python training.py --model xgboost_v2 --feature-mode base

# Enhanced features only (~48)
python training.py --model xgboost_v2 --feature-mode enhanced

# All features (~437) - recommended
python training.py --model xgboost_v2 --feature-mode combined

πŸ“Š Performance

Typical Validation Accuracies

Model Accuracy Training Time GPU Required
sklearn 65-70% 1 min No
xgboost 70-75% 5 min No
xgboost_v2 75-80% 10 min No
mlp 68-72% 5 min Optional
gru 72-76% 15 min Recommended
transformer 74-78% 20 min Recommended
s4_mil 76-80% 30 min Recommended
patchtst_xgb 77-81% 25 min Recommended
bigbird_hier 78-82% 40 min Required
auto (ensemble) 80-85% - -

Recommendations by Use Case

  • πŸƒ Quick baseline: sklearn or xgboost
  • 🎯 Best single model: xgboost_v2 (combined mode)
  • πŸ“ Long sequences (>1000 moves): s4_mil or bigbird_hier
  • πŸ† Competition submission: auto (ensemble all models)
  • πŸ’» GPU available: transformer or gru
  • ⚑ Limited compute: sklearn or mlp

πŸ”§ Advanced Usage

Data Augmentation

Extract overlapping sub-games to increase training samples:

python training.py --model gru --augment-data

Game Splitting

Split concatenated multi-game records into individual games:

python training.py --model xgboost_v2 --use-game-split

Chunk Strategies

Handle long sequences with different strategies:

# Take first N moves only
python training.py --model gru --chunk-strategy truncate

# Uniform sampling across sequence
python training.py --model gru --chunk-strategy sample --num-chunks 10

# Take last N moves only
python training.py --model gru --chunk-strategy tail

Batch Experiments

Run multiple experiments automatically:

# Run key experiments
bash scripts/run_key_experiments_game_split.sh

# Run all experiments
bash scripts/run_all_experiments_with_game_split.sh

Utility Scripts

# Check dataset statistics
python utils/check_data_stats.py

# Compare submission files
python utils/compare_results.py submissions/submission_*.csv

# Analyze feature distributions
python utils/diagnose_features.py

πŸ› Troubleshooting

CUDA Out of Memory

Problem: RuntimeError: CUDA out of memory

Solutions:

# Reduce batch size
python training.py --model transformer --batch-size 4

# Reduce max sequence length
python training.py --model gru --max-len 256

# Use CPU instead
python training.py --model gru --device cpu

Model File Not Found

Problem: FileNotFoundError: Failed to prepare model artifact

Solutions:

# Train the model first
python training.py --model gru

# Or use auto mode to skip missing models
python Q5.py --model auto

Feature Dimension Mismatch

Problem: ValueError: Feature dimension mismatch

Solution: Ensure consistent feature mode for XGBoost v2

# Check .feature_names.json for the feature mode used during training
cat models/xgboost_v2_model.feature_names.json

Training Too Slow

Problem: Training takes too long

Solutions:

# Use smaller model
python training.py --model mlp --epochs 10

# Reduce epochs
python training.py --model gru --epochs 10

# Use base features only
python training.py --model xgboost_v2 --feature-mode base

Poor Validation Accuracy

Problem: Model accuracy is too low

Solutions:

# Try data augmentation
python training.py --model xgboost --augment-data

# Try game splitting
python training.py --model xgboost_v2 --use-game-split

# Use ensemble
python Q5.py --model auto

🀝 Contributing

Contributions are welcome! Here are some ways you can contribute:

  1. Report bugs - Open an issue describing the bug and how to reproduce it
  2. Suggest features - Propose new features or improvements
  3. Submit pull requests - Fix bugs or implement new features
  4. Improve documentation - Help make the docs clearer and more comprehensive

Development Setup

# Clone the repository
git clone https://github.com/solitude6060/Go-Rank-Prediction.git
cd Go-Rank-Prediction

# Install development dependencies
pip install -r requirements.txt

# Run tests
bash tests/test_game_split.sh
python tests/test_gpu.py

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgments

  • KataGo - Neural Go engine used for feature extraction
  • PyTorch Team - Deep learning framework
  • XGBoost Team - Gradient boosting library
  • scikit-learn Team - Machine learning utilities

πŸ“ž Contact

For questions, issues, or collaboration opportunities:


πŸ“š Additional Resources


⭐ If you find this project helpful, please consider giving it a star! ⭐

Made with ❀️ for the Go and Machine Learning communities

About

NTU Machine Learning Class Fall 2025 Assignment 1 Q5

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published