🧠 EmbodiedMind

VLM-Guided Embodied Agent with Hierarchical Planning and Episodic Memory in MineDojo

EmbodiedMind is a research framework for building multimodal embodied agents that perceive, reason, plan, and act in open-ended 3D environments. Built on MineDojo (Minecraft), it combines Vision-Language Models (VLMs) for grounded perception, LLMs for hierarchical task planning, and an episodic memory system for experience-driven adaptation.

🎯 Key Features

Multimodal Perception: VLM-based visual grounding that converts raw game frames into structured scene descriptions (entities, resources, terrain, threats)
Hierarchical Planning: Two-level planner — a high-level LLM strategist decomposes goals into subgoals, and a low-level action translator maps subgoals to executable MineDojo actions
Episodic Memory: Vector-similarity retrieval of past experiences to enable in-context learning — the agent recalls what worked (and what failed) in similar situations
Skill Library: Reusable, composable action sequences discovered through experience and stored for future retrieval
Evaluation Suite: Automated benchmarking across MineDojo Harvest, Survival, and Tech Tree tasks with standardized metrics

🏗️ Architecture

┌─────────────────────────────────────────────────────┐
│                   EmbodiedMind Agent                │
├──────────┬──────────────┬──────────┬────────────────┤
│  Visual  │  Hierarchical│ Episodic │    Skill       │
│ Perceiver│   Planner    │  Memory  │   Library      │
│  (VLM)   │   (LLM)      │ (Vector) │  (Code-as-    │
│          │              │          │   Action)      │
├──────────┴──────────────┴──────────┴────────────────┤
│              Action Executor (MineDojo API)          │
├─────────────────────────────────────────────────────┤
│              MineDojo Environment (Minecraft)        │
└─────────────────────────────────────────────────────┘

Agent Loop:

Observe → Capture RGB frame from MineDojo environment
Perceive → VLM extracts structured scene description (entities, inventory, threats)
Recall → Query episodic memory for relevant past experiences
Plan → LLM generates/updates hierarchical plan using perception + memory context
Act → Translate plan step into MineDojo action and execute
Reflect → Store outcome in episodic memory for future retrieval

📦 Installation

Prerequisites

Python ≥ 3.9
JDK 8 (for Minecraft backend)
GPU recommended for VLM inference (or use API-based models)

Setup

git clone https://github.com/YOUR_USERNAME/embodied-mind.git
cd embodied-mind

# Create conda environment
conda create -n embodied-mind python=3.10 -y
conda activate embodied-mind

# Install MineDojo
pip install minedojo

# Install project dependencies
pip install -e .

Configure API Keys

cp .env.example .env
# Edit .env with your API key (supports OpenAI, Google Gemini, or local models)

🚀 Quick Start

Run a single task

python -m embodied_mind.run \
    --task "harvest_milk" \
    --model "gemini-2.0-flash" \
    --max_steps 3000 \
    --memory_enabled \
    --verbose

Run the evaluation suite

python -m embodied_mind.evaluate \
    --task_suite harvest \
    --model "gemini-2.0-flash" \
    --num_episodes 5 \
    --output_dir results/

Visualize agent behavior

python -m embodied_mind.visualize \
    --replay results/harvest_milk_ep0.json \
    --show_memory \
    --show_plan

📊 Benchmark Results

🚧 Work in Progress — Evaluation experiments are currently being run on MineDojo Programmatic Tasks (Harvest, Survival, Tech Tree). Results will be updated here as they become available.

📁 Project Structure

embodied-mind/
├── embodied_mind/
│   ├── __init__.py
│   ├── agent.py              # Main agent loop
│   ├── perceiver.py          # VLM-based visual perception
│   ├── planner.py            # Hierarchical LLM planner
│   ├── memory.py             # Episodic memory with vector retrieval
│   ├── skills.py             # Skill library management
│   ├── action_executor.py    # MineDojo action translation
│   ├── run.py                # Single-task runner
│   ├── evaluate.py           # Benchmark evaluation
│   └── visualize.py          # Replay visualization
├── configs/
│   ├── default.yaml          # Default agent config
│   └── tasks.yaml            # Task definitions
├── prompts/
│   ├── perceiver.txt         # VLM perception prompt
│   ├── planner.txt           # LLM planning prompt
│   └── reflector.txt         # Post-action reflection prompt
├── results/                  # Evaluation outputs
├── assets/                   # Architecture diagrams
├── tests/
│   └── test_agent.py
├── setup.py
├── .env.example
└── README.md

🔬 Research Details

Visual Perception

The perceiver sends RGB frames to a VLM (Gemini, GPT-4o, or local LLaVA) with a structured prompt asking for:

Entities: Nearby mobs, animals, villagers with estimated distances
Resources: Visible blocks, items, craftable materials
Terrain: Biome type, elevation, obstacles
Threats: Hostile mobs, environmental dangers
Inventory State: Current items and their quantities

Hierarchical Planning

The planner operates at two levels:

Strategist: Decomposes the goal into an ordered list of subgoals (e.g., "craft stone pickaxe" → find_stone → mine_cobblestone × 3 → find_wood → craft_planks → craft_sticks → craft_pickaxe)
Tactician: Converts each subgoal into a sequence of MineDojo-compatible actions, re-planning when the environment state changes unexpectedly

Episodic Memory

Each experience is stored as:

{
    "task": str,               # Task being attempted
    "observation": str,        # Scene description at decision point
    "plan": str,               # Plan that was executed
    "actions": List[str],      # Action sequence taken
    "outcome": str,            # Success/failure description
    "reward": float,           # MineDojo reward signal
    "embedding": List[float]   # Text embedding for similarity search
}

At decision time, the agent retrieves the top-k most similar past experiences and includes them as in-context examples for the planner.

Adaptation via In-Context Learning

Rather than fine-tuning model weights, EmbodiedMind adapts through:

Experience accumulation: Memory grows across episodes
Failure avoidance: Failed strategies are explicitly noted in retrieved context
Strategy transfer: Successful plans from similar tasks inform new situations

🤝 Contributing

Contributions are welcome!

📄 License

MIT License. See LICENSE for details.

📚 References

MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge (NeurIPS 2022, Outstanding Paper)
Voyager: An Open-Ended Embodied Agent with Large Language Models
ODYSSEY: Empowering Minecraft Agents with Open-World Skills (IJCAI 2025)
STEVE-1: A Generative Model for Text-to-Behavior in Minecraft

✉️ Contact

Vishal Chauhan — vishalchauhan@outlook.sg Ph.D. Candidate, The University of Tokyo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 EmbodiedMind

🎯 Key Features

🏗️ Architecture

📦 Installation

Prerequisites

Setup

Configure API Keys

🚀 Quick Start

Run a single task

Run the evaluation suite

Visualize agent behavior

📊 Benchmark Results

📁 Project Structure

🔬 Research Details

Visual Perception

Hierarchical Planning

Episodic Memory

Adaptation via In-Context Learning

🤝 Contributing

📄 License

📚 References

✉️ Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
configs		configs
embodied_mind		embodied_mind
prompts		prompts
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

🧠 EmbodiedMind

🎯 Key Features

🏗️ Architecture

📦 Installation

Prerequisites

Setup

Configure API Keys

🚀 Quick Start

Run a single task

Run the evaluation suite

Visualize agent behavior

📊 Benchmark Results

📁 Project Structure

🔬 Research Details

Visual Perception

Hierarchical Planning

Episodic Memory

Adaptation via In-Context Learning

🤝 Contributing

📄 License

📚 References

✉️ Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages