A comprehensive research framework implementing multiple reinforcement learning approaches for training agents to play Super Mario Bros Level 1-1. This project combines traditional deep RL methods with cutting-edge LLM-based techniques through a unified OpenEnv-compatible interface.
This repository contains four complementary approaches to solving Super Mario Bros through reinforcement learning:
mario_env/ - OpenEnv-Compatible Environment Wrapper
- OpenEnv Protocol: Standardized HTTP-based environment interface
- Rich RAM Features: Detailed enemy tracking, obstacle detection, powerup analysis
- Multiple Action Sets: Simple (7), complex (12), and right-only (5) action spaces
- Advanced Preprocessing: Frame stacking, grayscale conversion, downsampling
- Docker Deployment: Containerized environment server for distributed training
mario_ppo/ - Traditional PPO Implementation
- Convolutional Neural Networks: Visual policy learning from pixel observations
- Parallel Environment Execution: 16+ parallel environments for efficient training
- Stable Training: Proximal Policy Optimization with Generalized Advantage Estimation
- Real-time Inference: 1000+ FPS execution speed
- Sample Efficient: Learns from millions of gameplay frames
mario_grpo/ - LLM-Based GRPO Training
- Code Generation as Policy: LLMs generate Python strategies instead of neural policies
- Interpretable Strategies: Human-readable code with reasoning
- Long-term Planning: Strategic decision-making beyond reactive control
- Parallel Strategy Evaluation: Multiple strategies tested simultaneously
- Transfer Learning: Leverages pre-trained language model knowledge
mario_baseline/ - Random Agent Baseline
- Performance Reference: Establishes minimum performance thresholds
- Statistical Analysis: Comprehensive evaluation metrics
- Video Recording: Qualitative gameplay analysis
- Reproducibility: Deterministic random action selection
- Python 3.12+
- CUDA-compatible GPU (recommended for training)
- Docker (for environment deployment)
# Clone repository
git clone https://github.com/3xCaffeine/mario-openenv.git
cd mario-openenv
# Install with uv (recommended)
uv sync
# For GPU support
uv sync --extra gpu# Start Docker environment
cd mario_env
docker build -t mario-env .
docker run -p 8000:8000 mario-env
# Or run locally
uv run python -m mario_env.servercd mario_ppo
uv run python train.py --world 1 --stage 1cd mario_grpo
uv run python train.pycd mario_baseline
uv run python mario_random.py --episodes 100┌─────────────────┐ HTTP ┌──────────────────┐
│ RL Agent │◄──────────►│ Mario Env │
│ (PPO/GRPO) │ │ Server │
└─────────────────┘ └──────────────────┘
│
▼
┌──────────────────┐
│ Super Mario Bros │
│ (NES Emulator) │
└──────────────────┘
- Visual Input → CNN Feature Extraction
- Policy Network → Action Probabilities
- Value Network → State Value Estimation
- PPO Optimization → Policy Improvement
- Game State → Structured Observation
- Language Model → Python Strategy Generation
- Code Execution → Strategy Evaluation
- GRPO Optimization → Strategy Improvement
# Game settings
export MARIO_LEVEL="SuperMarioBros-1-1-Vanilla"
export MARIO_ACTION_SET="simple" # simple/complex/right_only
# Observation settings
export MARIO_OBS_MODE="downsampled" # rgb/grayscale/downsampled
export MARIO_OBS_SIZE="84"
export MARIO_FRAME_STACK="4"
# Training settings
export MARIO_REWARD_X_POS="true"
export MARIO_EPISODIC_LIFE="true"- PPO: Custom CNN with 32 filters, 512 hidden units
- GRPO: Qwen2.5-Coder-3B-Instruct with LoRA fine-tuning
- Training: Mixed precision, gradient accumulation, distributed execution
- Visual: 84×84 grayscale/downsampled RGB frames
- RAM Features: Enemy positions, obstacle detection, powerup tracking
- Player State: Position, velocity, power-up status, lives
- Game State: Score, coins, time, world/stage progression
- Simple (7 actions): Basic movement + jump combinations
- Complex (12 actions): Full NES controller including up/down
- Right-Only (5 actions): Forward-only movement for easier learning
- Primary: Score progression and level completion
- Auxiliary: X-position advancement, enemy defeat, coin collection
- Penalties: Time expiration, life loss, backward movement
- Traditional vs LLM-based RL: Performance and efficiency trade-offs
- Sample Efficiency: Frames vs episodes required for learning
- Generalization: Transfer across levels and game variants
- Strategy Analysis: Understanding LLM-generated gameplay logic
- Decision Trees: Extracting rules from trained neural policies
- Human-AI Collaboration: Combining human expertise with learned strategies
- RAM Feature Impact: Effect of auxiliary observations on learning
- Reward Engineering: Optimal reward shaping for complex games
- Curriculum Learning: Progressive difficulty for stable training
# Install development dependencies
uv sync --extra gpumario-openenv/
├── mario_env/ # OpenEnv wrapper
├── mario_ppo/ # PPO implementation
├── mario_grpo/ # GRPO training
├── mario_baseline/ # Random baseline
├── tests/ # Test suite
└── pyproject.toml # Project configuration
- Environment Setup: Detailed environment configuration and API
- PPO Training: Traditional RL implementation guide
- GRPO Training: LLM-based training methodology
- Baseline Evaluation: Baseline benchmarking
This project builds upon several key open-source implementations and research frameworks:
- OpenEnv: Standardized environment protocol for reinforcement learning
- Leirbag-gabrieL's gym-super-mario-bros fork: Enhanced Super Mario Bros environment with improved stability
- Leirbag-gabrieL's nes-py fork: NES emulator with Python bindings and bug fixes
- vietnh1009's PPO implementation: Original PyTorch PPO implementation for Super Mario Bros
- Unsloth: Efficient LLM fine-tuning and inference optimization
- Unsloth Zoo: Collection of pre-trained models optimized for Unsloth
- Triton: Language and compiler for writing highly efficient GPU code
- PyTorch: Deep learning framework for neural network implementations
- Transformers: Hugging Face library for LLM model handling
- TRL (Transformer Reinforcement Learning): Library for training transformer-based RL models
- Gymnasium: Modern reinforcement learning environments (successor to OpenAI Gym)
- Modal: Cloud platform for scalable ML training and deployment
- FastAPI: Modern web framework for the environment server
- OpenCV: Computer vision library for image processing
- OpenAI Gym Super Mario Bros: Original environment implementation that inspired this work
- NES emulator community: For maintaining and improving NES emulation technology
- Reinforcement learning research community: For developing the algorithms and methodologies used
This project is open source and available under the MIT License.
Built for reinforcement learning research on classic games