Skip to content

CNN based PPO agent and LLM based GRPO agent to play SMB on openenv wrapper using Leirbag-gabrieL's gym-super-mario-bros fork

License

Notifications You must be signed in to change notification settings

3xCaffeine/mario-openenv

Repository files navigation

Mario OpenEnv: Multi-Approach Reinforcement Learning for Super Mario Bros

A comprehensive research framework implementing multiple reinforcement learning approaches for training agents to play Super Mario Bros Level 1-1. This project combines traditional deep RL methods with cutting-edge LLM-based techniques through a unified OpenEnv-compatible interface.

Overview

This repository contains four complementary approaches to solving Super Mario Bros through reinforcement learning:

Core Components

mario_env/ - OpenEnv-Compatible Environment Wrapper

  • OpenEnv Protocol: Standardized HTTP-based environment interface
  • Rich RAM Features: Detailed enemy tracking, obstacle detection, powerup analysis
  • Multiple Action Sets: Simple (7), complex (12), and right-only (5) action spaces
  • Advanced Preprocessing: Frame stacking, grayscale conversion, downsampling
  • Docker Deployment: Containerized environment server for distributed training

mario_ppo/ - Traditional PPO Implementation

  • Convolutional Neural Networks: Visual policy learning from pixel observations
  • Parallel Environment Execution: 16+ parallel environments for efficient training
  • Stable Training: Proximal Policy Optimization with Generalized Advantage Estimation
  • Real-time Inference: 1000+ FPS execution speed
  • Sample Efficient: Learns from millions of gameplay frames

mario_grpo/ - LLM-Based GRPO Training

  • Code Generation as Policy: LLMs generate Python strategies instead of neural policies
  • Interpretable Strategies: Human-readable code with reasoning
  • Long-term Planning: Strategic decision-making beyond reactive control
  • Parallel Strategy Evaluation: Multiple strategies tested simultaneously
  • Transfer Learning: Leverages pre-trained language model knowledge

mario_baseline/ - Random Agent Baseline

  • Performance Reference: Establishes minimum performance thresholds
  • Statistical Analysis: Comprehensive evaluation metrics
  • Video Recording: Qualitative gameplay analysis
  • Reproducibility: Deterministic random action selection

Quick Start

Prerequisites

  • Python 3.12+
  • CUDA-compatible GPU (recommended for training)
  • Docker (for environment deployment)

Installation

# Clone repository
git clone https://github.com/3xCaffeine/mario-openenv.git
cd mario-openenv

# Install with uv (recommended)
uv sync

# For GPU support
uv sync --extra gpu

Basic Usage

Environment Server

# Start Docker environment
cd mario_env
docker build -t mario-env .
docker run -p 8000:8000 mario-env

# Or run locally
uv run python -m mario_env.server

PPO Training

cd mario_ppo
uv run python train.py --world 1 --stage 1

GRPO Training

cd mario_grpo
uv run python train.py

Baseline Evaluation

cd mario_baseline
uv run python mario_random.py --episodes 100

Architecture

Environment Interface

┌─────────────────┐    HTTP    ┌──────────────────┐
│   RL Agent      │◄──────────►│  Mario Env      │
│  (PPO/GRPO)     │            │  Server         │
└─────────────────┘            └──────────────────┘
                                   │
                                   ▼
                            ┌──────────────────┐
                            │ Super Mario Bros │
                            │   (NES Emulator) │
                            └──────────────────┘

Training Approaches

Traditional RL Pipeline

  1. Visual Input → CNN Feature Extraction
  2. Policy Network → Action Probabilities
  3. Value Network → State Value Estimation
  4. PPO Optimization → Policy Improvement

LLM-Based Pipeline

  1. Game State → Structured Observation
  2. Language Model → Python Strategy Generation
  3. Code Execution → Strategy Evaluation
  4. GRPO Optimization → Strategy Improvement

Configuration

Environment Variables

# Game settings
export MARIO_LEVEL="SuperMarioBros-1-1-Vanilla"
export MARIO_ACTION_SET="simple"  # simple/complex/right_only

# Observation settings
export MARIO_OBS_MODE="downsampled"  # rgb/grayscale/downsampled
export MARIO_OBS_SIZE="84"
export MARIO_FRAME_STACK="4"

# Training settings
export MARIO_REWARD_X_POS="true"
export MARIO_EPISODIC_LIFE="true"

Model Configuration

  • PPO: Custom CNN with 32 filters, 512 hidden units
  • GRPO: Qwen2.5-Coder-3B-Instruct with LoRA fine-tuning
  • Training: Mixed precision, gradient accumulation, distributed execution

Game Features

Observation Space

  • Visual: 84×84 grayscale/downsampled RGB frames
  • RAM Features: Enemy positions, obstacle detection, powerup tracking
  • Player State: Position, velocity, power-up status, lives
  • Game State: Score, coins, time, world/stage progression

Action Space

  • Simple (7 actions): Basic movement + jump combinations
  • Complex (12 actions): Full NES controller including up/down
  • Right-Only (5 actions): Forward-only movement for easier learning

Reward Structure

  • Primary: Score progression and level completion
  • Auxiliary: X-position advancement, enemy defeat, coin collection
  • Penalties: Time expiration, life loss, backward movement

Research Applications

Algorithm Comparison

  • Traditional vs LLM-based RL: Performance and efficiency trade-offs
  • Sample Efficiency: Frames vs episodes required for learning
  • Generalization: Transfer across levels and game variants

Interpretability Studies

  • Strategy Analysis: Understanding LLM-generated gameplay logic
  • Decision Trees: Extracting rules from trained neural policies
  • Human-AI Collaboration: Combining human expertise with learned strategies

Environment Research

  • RAM Feature Impact: Effect of auxiliary observations on learning
  • Reward Engineering: Optimal reward shaping for complex games
  • Curriculum Learning: Progressive difficulty for stable training

Contributing

Development Setup

# Install development dependencies
uv sync --extra gpu

Project Structure

mario-openenv/
├── mario_env/          # OpenEnv wrapper
├── mario_ppo/          # PPO implementation
├── mario_grpo/         # GRPO training
├── mario_baseline/     # Random baseline
├── tests/              # Test suite
└── pyproject.toml      # Project configuration

Documentation

Acknowledgments

This project builds upon several key open-source implementations and research frameworks:

Core Dependencies and Forks

Research Frameworks and Libraries

  • PyTorch: Deep learning framework for neural network implementations
  • Transformers: Hugging Face library for LLM model handling
  • TRL (Transformer Reinforcement Learning): Library for training transformer-based RL models
  • Gymnasium: Modern reinforcement learning environments (successor to OpenAI Gym)
  • Modal: Cloud platform for scalable ML training and deployment
  • FastAPI: Modern web framework for the environment server
  • OpenCV: Computer vision library for image processing

Additional Acknowledgments

  • OpenAI Gym Super Mario Bros: Original environment implementation that inspired this work
  • NES emulator community: For maintaining and improving NES emulation technology
  • Reinforcement learning research community: For developing the algorithms and methodologies used

License

This project is open source and available under the MIT License.


Built for reinforcement learning research on classic games

About

CNN based PPO agent and LLM based GRPO agent to play SMB on openenv wrapper using Leirbag-gabrieL's gym-super-mario-bros fork

Topics

Resources

License

Stars

Watchers

Forks