🎯 Go Rank Prediction System

A comprehensive machine learning system for predicting Go player ranks (1D-9D) from game move sequences. This project implements a multi-model ensemble approach combining traditional ML and deep learning techniques to achieve state-of-the-art performance.

📋 Table of Contents

Features
Quick Start
Installation
Project Structure
Usage
- Training Models
- Generating Submissions
Model Architecture
Feature Engineering
Performance
Advanced Usage
Troubleshooting
Contributing
License

✨ Features

🎨 Multi-Model Ensemble

9 different model architectures ranging from traditional ML to cutting-edge deep learning
Automatic ensemble with probability averaging for robust predictions
Support for both tabular and sequence-based approaches

🔧 Comprehensive Feature Engineering

~389 base features with statistical aggregations
~48 enhanced features with domain-specific Go metrics
~437 combined features for maximum model capacity
Temporal dynamics, phase-specific patterns, and strategic indicators

🚀 Production-Ready Pipeline

Automatic model training if artifacts are missing
Progress tracking with tqdm progress bars
Cross-validation and early stopping
GPU acceleration with automatic fallback to CPU

📊 Advanced Sequence Modeling

Handles variable-length sequences (50-8000+ moves)
Long-sequence support with S4, PatchTST, and BigBird
Multiple chunking strategies for memory efficiency
Data augmentation with overlapping sub-games

🚀 Quick Start

# Clone the repository
git clone https://github.com/solitude6060/Go-Rank-Prediction.git
cd Go-Rank-Prediction

# Install dependencies
pip install -r requirements.txt

# Train a baseline model
python training.py --model xgboost_v2 --feature-mode combined

# Generate submission file
python Q5.py --model auto

# Output: submission.csv and submission_meta.json

Note: Training data (train_set/) and test data (test_set/) are not included in this repository due to size constraints. Place them in the project root directory before running.

📦 Installation

Prerequisites

Python 3.10 or higher
CUDA-compatible GPU (optional, for faster training)
8GB+ RAM recommended

Dependencies

# Core dependencies
pip install numpy pandas scikit-learn xgboost torch tqdm

# Optional: For long-sequence models
pip install einops transformers

Or install all at once:

pip install -r requirements.txt

📁 Project Structure

Go-Rank-Prediction/
├── README.md                          # This file
├── README.txt                         # Plain text documentation backup
├── LICENSE                            # MIT License
├── requirements.txt                   # Python dependencies
├── .gitignore                         # Git ignore rules
│
├── Q5.py                              # Main inference script
├── training.py                        # Main training script
│
├── src/                               # Core source modules
│   ├── __init__.py
│   ├── data.py                        # Data parsing & loading
│   ├── features.py                    # Base feature engineering (~389 features)
│   ├── features_enhanced.py           # Enhanced features (~48 features)
│   ├── augmentation.py                # Data augmentation utilities
│   ├── models_sklearn.py              # Sklearn models (Logistic Regression)
│   ├── models_xgb.py                  # XGBoost (base)
│   ├── models_xgb_enhanced.py         # XGBoost v2 (with enhanced features)
│   ├── models_torch.py                # PyTorch models (MLP/GRU/Transformer)
│   └── models_longseq.py              # Long-sequence models (S4/PatchTST/BigBird)
│
├── submissions/                       # Generated submission files
│   ├── submission_*.csv
│   └── submission_meta.json
│
├── tests/                             # Test and debug scripts
│   ├── debug_device.py
│   ├── test_gpu.py
│   └── test_game_split.sh
│
├── utils/                             # Utility scripts
│   ├── check_data_stats.py            # Dataset statistics
│   ├── compare_results.py             # Compare submission files
│   ├── diagnose_features.py           # Feature distribution analysis
│   └── data_utils_spilt.py            # Game splitting utilities
│
├── scripts/                           # Experiment automation
│   ├── run_all_experiments_with_game_split.sh
│   └── run_key_experiments_game_split.sh
│
├── docs/                              # Additional documentation
│   ├── EXPERIMENTS_README.md          # Experiment tracking guide
│   └── QUICK_START.md                 # Quick start tutorial
│
├── submission_sample.csv              # Sample submission format
│
├── train_set/                         # Training data (not in repo - too large)
├── test_set/                          # Test data (not in repo - too large)
└── models/                            # Trained models (not in repo - too large)

📚 Usage

Training Models

The training.py script provides a unified CLI for training all supported models.

Basic Models

# Logistic Regression baseline
python training.py --model sklearn

# XGBoost (base features)
python training.py --model xgboost

# XGBoost v2 (enhanced features - recommended)
python training.py --model xgboost_v2 --feature-mode combined

Deep Learning Models

# Multi-Layer Perceptron on aggregated features
python training.py --model mlp --epochs 50 --batch-size 32

# GRU sequence model
python training.py --model gru --max-len 512 --epochs 20

# Transformer sequence model
python training.py --model transformer --batch-size 8 --epochs 15

Long-Sequence Models

For games with 1000+ moves:

# S4 with Multi-Instance Learning
python training.py --model s4_mil --max-len 2048 --batch-size 16

# PatchTST encoder + XGBoost head
python training.py --model patchtst_xgb --max-len 2048

# BigBird hierarchical attention
python training.py --model bigbird_hier --max-len 8192 --batch-size 4

Advanced Training Options

# With data augmentation
python training.py --model gru --augment-data

# With game splitting
python training.py --model xgboost_v2 --use-game-split

# With cross-validation
python training.py --model xgboost_v2 --xgb-use-cv --xgb-n-splits 5

# Custom hyperparameters
python training.py --model transformer \
    --max-len 1024 \
    --batch-size 16 \
    --epochs 30 \
    --lr 1e-3 \
    --weight-decay 1e-4

Generating Submissions

The Q5.py script generates Kaggle-ready submission files.

Auto Ensemble (Recommended)

Automatically uses all available trained models:

python Q5.py --model auto

This will:

Load all trained models from models/
Generate predictions from each model
Average probability distributions
Output submission.csv and submission_meta.json

Single Model

Use a specific model for inference:

# Use XGBoost v2
python Q5.py --model xgboost_v2

# Use Transformer
python Q5.py --model transformer

# Use GRU
python Q5.py --model gru

Custom Paths

python Q5.py \
    --data-dir /path/to/data \
    --models-dir /path/to/models \
    --submission my_submission.csv

🧠 Model Architecture

Traditional Machine Learning

Model	Description	Training Time	Features Used
sklearn	Logistic Regression baseline	~1 min	~389
xgboost	Gradient Boosting (base)	~5 min	~389
xgboost_v2	Enhanced XGBoost	~10 min	389/48/437

Deep Learning - Tabular

Model	Description	Training Time	Input
mlp	3-layer MLP	~5 min	~389 features

Deep Learning - Sequence

Model	Description	Training Time	Max Length
gru	Bidirectional GRU + Attention	~15 min	512-1024
transformer	Multi-head Self-Attention	~20 min	512-1024

Deep Learning - Long Sequence

Model	Description	Training Time	Max Length
s4_mil	Structured State Space + MIL	~30 min	2048-4096
patchtst_xgb	Patch-based TST + XGB head	~25 min	2048-4096
bigbird_hier	Sparse Attention (hierarchical)	~40 min	4096-8192

🔬 Feature Engineering

Per-Move Raw Features (376 dimensions)

Each move in a Go game is represented by:

Basic move info (9 features)
- Color (black/white)
- Coordinates (x, y)
- Is pass move
- Move fraction (position in game)
- Strength, winrate, lead, uncertainty
Policy distribution (9 values)
- Predicted rank probabilities for this move
Value estimates (9 values)
- Win probability estimates per rank
Rank probabilities (9 values)
- Neural network rank predictions

Aggregated Features

Base Features (~389)

Generated by features.py:

Global statistics: num_moves, white_fraction
Coordinate stats: mean, std, min, max, median, IQR for x/y
Per-feature aggregations: 376 features × 6 stats each
Temporal trends: winrate_trend, lead_trend, strength_trend
Phase-specific: opening/endgame statistics
Volatility metrics: move quality variation

Enhanced Features (~48)

Generated by features_enhanced.py:

Opening phase (first 20 moves): pattern recognition
Middle game (moves 21-100): dynamic analysis
Endgame (last 20 moves): precision metrics
Consistency metrics: move quality stability
Strategic indicators: aggressive/defensive indices

Combined Features (~437)

Base + Enhanced for maximum model capacity.

Feature Modes (XGBoost v2)

# Base features only (~389)
python training.py --model xgboost_v2 --feature-mode base

# Enhanced features only (~48)
python training.py --model xgboost_v2 --feature-mode enhanced

# All features (~437) - recommended
python training.py --model xgboost_v2 --feature-mode combined

📊 Performance

Typical Validation Accuracies

Model	Accuracy	Training Time	GPU Required
sklearn	65-70%	1 min	No
xgboost	70-75%	5 min	No
xgboost_v2	75-80%	10 min	No
mlp	68-72%	5 min	Optional
gru	72-76%	15 min	Recommended
transformer	74-78%	20 min	Recommended
s4_mil	76-80%	30 min	Recommended
patchtst_xgb	77-81%	25 min	Recommended
bigbird_hier	78-82%	40 min	Required
auto (ensemble)	80-85%	-	-

Recommendations by Use Case

🏃 Quick baseline: sklearn or xgboost
🎯 Best single model: xgboost_v2 (combined mode)
📏 Long sequences (>1000 moves): s4_mil or bigbird_hier
🏆 Competition submission: auto (ensemble all models)
💻 GPU available: transformer or gru
⚡ Limited compute: sklearn or mlp

🔧 Advanced Usage

Data Augmentation

Extract overlapping sub-games to increase training samples:

python training.py --model gru --augment-data

Game Splitting

Split concatenated multi-game records into individual games:

python training.py --model xgboost_v2 --use-game-split

Chunk Strategies

Handle long sequences with different strategies:

# Take first N moves only
python training.py --model gru --chunk-strategy truncate

# Uniform sampling across sequence
python training.py --model gru --chunk-strategy sample --num-chunks 10

# Take last N moves only
python training.py --model gru --chunk-strategy tail

Batch Experiments

Run multiple experiments automatically:

# Run key experiments
bash scripts/run_key_experiments_game_split.sh

# Run all experiments
bash scripts/run_all_experiments_with_game_split.sh

Utility Scripts

# Check dataset statistics
python utils/check_data_stats.py

# Compare submission files
python utils/compare_results.py submissions/submission_*.csv

# Analyze feature distributions
python utils/diagnose_features.py

🐛 Troubleshooting

CUDA Out of Memory

Problem: RuntimeError: CUDA out of memory

Solutions:

# Reduce batch size
python training.py --model transformer --batch-size 4

# Reduce max sequence length
python training.py --model gru --max-len 256

# Use CPU instead
python training.py --model gru --device cpu

Model File Not Found

Problem: FileNotFoundError: Failed to prepare model artifact

Solutions:

# Train the model first
python training.py --model gru

# Or use auto mode to skip missing models
python Q5.py --model auto

Feature Dimension Mismatch

Problem: ValueError: Feature dimension mismatch

Solution: Ensure consistent feature mode for XGBoost v2

# Check .feature_names.json for the feature mode used during training
cat models/xgboost_v2_model.feature_names.json

Training Too Slow

Problem: Training takes too long

Solutions:

# Use smaller model
python training.py --model mlp --epochs 10

# Reduce epochs
python training.py --model gru --epochs 10

# Use base features only
python training.py --model xgboost_v2 --feature-mode base

Poor Validation Accuracy

Problem: Model accuracy is too low

Solutions:

# Try data augmentation
python training.py --model xgboost --augment-data

# Try game splitting
python training.py --model xgboost_v2 --use-game-split

# Use ensemble
python Q5.py --model auto

🤝 Contributing

Contributions are welcome! Here are some ways you can contribute:

Report bugs - Open an issue describing the bug and how to reproduce it
Suggest features - Propose new features or improvements
Submit pull requests - Fix bugs or implement new features
Improve documentation - Help make the docs clearer and more comprehensive

Development Setup

# Clone the repository
git clone https://github.com/solitude6060/Go-Rank-Prediction.git
cd Go-Rank-Prediction

# Install development dependencies
pip install -r requirements.txt

# Run tests
bash tests/test_game_split.sh
python tests/test_gpu.py

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

KataGo - Neural Go engine used for feature extraction
PyTorch Team - Deep learning framework
XGBoost Team - Gradient boosting library
scikit-learn Team - Machine learning utilities

📞 Contact

For questions, issues, or collaboration opportunities:

GitHub Issues: https://github.com/solitude6060/Go-Rank-Prediction/issues
Repository: https://github.com/solitude6060/Go-Rank-Prediction

📚 Additional Resources

⭐ If you find this project helpful, please consider giving it a star! ⭐

Made with ❤️ for the Go and Machine Learning communities

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs		docs
scripts		scripts
src		src
submissions		submissions
tests		tests
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
Q5.py		Q5.py
README.md		README.md
README.txt		README.txt
requirements.txt		requirements.txt
submission_sample.csv		submission_sample.csv
training.py		training.py

License

solitude6060/Go-Rank-Prediction

Folders and files

Latest commit

History

Repository files navigation

🎯 Go Rank Prediction System

📋 Table of Contents

✨ Features

🎨 Multi-Model Ensemble

🔧 Comprehensive Feature Engineering

🚀 Production-Ready Pipeline

📊 Advanced Sequence Modeling

🚀 Quick Start

📦 Installation

Prerequisites

Dependencies

📁 Project Structure

📚 Usage

Training Models

Basic Models

Deep Learning Models

Long-Sequence Models

Advanced Training Options

Generating Submissions

Auto Ensemble (Recommended)

Single Model

Custom Paths

🧠 Model Architecture

Traditional Machine Learning

Deep Learning - Tabular

Deep Learning - Sequence

Deep Learning - Long Sequence

🔬 Feature Engineering

Per-Move Raw Features (376 dimensions)

Aggregated Features

Base Features (~389)

Enhanced Features (~48)

Combined Features (~437)

Feature Modes (XGBoost v2)

📊 Performance

Typical Validation Accuracies

Recommendations by Use Case

🔧 Advanced Usage

Data Augmentation

Game Splitting

Chunk Strategies

Batch Experiments

Utility Scripts

🐛 Troubleshooting

CUDA Out of Memory

Model File Not Found

Feature Dimension Mismatch

Training Too Slow

Poor Validation Accuracy

🤝 Contributing

Development Setup

📄 License

🙏 Acknowledgments

📞 Contact

📚 Additional Resources

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages