VLM-Guided Embodied Agent with Hierarchical Planning and Episodic Memory in MineDojo
EmbodiedMind is a research framework for building multimodal embodied agents that perceive, reason, plan, and act in open-ended 3D environments. Built on MineDojo (Minecraft), it combines Vision-Language Models (VLMs) for grounded perception, LLMs for hierarchical task planning, and an episodic memory system for experience-driven adaptation.
- Multimodal Perception: VLM-based visual grounding that converts raw game frames into structured scene descriptions (entities, resources, terrain, threats)
- Hierarchical Planning: Two-level planner β a high-level LLM strategist decomposes goals into subgoals, and a low-level action translator maps subgoals to executable MineDojo actions
- Episodic Memory: Vector-similarity retrieval of past experiences to enable in-context learning β the agent recalls what worked (and what failed) in similar situations
- Skill Library: Reusable, composable action sequences discovered through experience and stored for future retrieval
- Evaluation Suite: Automated benchmarking across MineDojo Harvest, Survival, and Tech Tree tasks with standardized metrics
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EmbodiedMind Agent β
ββββββββββββ¬βββββββββββββββ¬βββββββββββ¬βββββββββββββββββ€
β Visual β Hierarchicalβ Episodic β Skill β
β Perceiverβ Planner β Memory β Library β
β (VLM) β (LLM) β (Vector) β (Code-as- β
β β β β Action) β
ββββββββββββ΄βββββββββββββββ΄βββββββββββ΄βββββββββββββββββ€
β Action Executor (MineDojo API) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β MineDojo Environment (Minecraft) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Agent Loop:
- Observe β Capture RGB frame from MineDojo environment
- Perceive β VLM extracts structured scene description (entities, inventory, threats)
- Recall β Query episodic memory for relevant past experiences
- Plan β LLM generates/updates hierarchical plan using perception + memory context
- Act β Translate plan step into MineDojo action and execute
- Reflect β Store outcome in episodic memory for future retrieval
- Python β₯ 3.9
- JDK 8 (for Minecraft backend)
- GPU recommended for VLM inference (or use API-based models)
git clone https://github.com/YOUR_USERNAME/embodied-mind.git
cd embodied-mind
# Create conda environment
conda create -n embodied-mind python=3.10 -y
conda activate embodied-mind
# Install MineDojo
pip install minedojo
# Install project dependencies
pip install -e .cp .env.example .env
# Edit .env with your API key (supports OpenAI, Google Gemini, or local models)python -m embodied_mind.run \
--task "harvest_milk" \
--model "gemini-2.0-flash" \
--max_steps 3000 \
--memory_enabled \
--verbosepython -m embodied_mind.evaluate \
--task_suite harvest \
--model "gemini-2.0-flash" \
--num_episodes 5 \
--output_dir results/python -m embodied_mind.visualize \
--replay results/harvest_milk_ep0.json \
--show_memory \
--show_planπ§ Work in Progress β Evaluation experiments are currently being run on MineDojo Programmatic Tasks (Harvest, Survival, Tech Tree). Results will be updated here as they become available.
embodied-mind/
βββ embodied_mind/
β βββ __init__.py
β βββ agent.py # Main agent loop
β βββ perceiver.py # VLM-based visual perception
β βββ planner.py # Hierarchical LLM planner
β βββ memory.py # Episodic memory with vector retrieval
β βββ skills.py # Skill library management
β βββ action_executor.py # MineDojo action translation
β βββ run.py # Single-task runner
β βββ evaluate.py # Benchmark evaluation
β βββ visualize.py # Replay visualization
βββ configs/
β βββ default.yaml # Default agent config
β βββ tasks.yaml # Task definitions
βββ prompts/
β βββ perceiver.txt # VLM perception prompt
β βββ planner.txt # LLM planning prompt
β βββ reflector.txt # Post-action reflection prompt
βββ results/ # Evaluation outputs
βββ assets/ # Architecture diagrams
βββ tests/
β βββ test_agent.py
βββ setup.py
βββ .env.example
βββ README.md
The perceiver sends RGB frames to a VLM (Gemini, GPT-4o, or local LLaVA) with a structured prompt asking for:
- Entities: Nearby mobs, animals, villagers with estimated distances
- Resources: Visible blocks, items, craftable materials
- Terrain: Biome type, elevation, obstacles
- Threats: Hostile mobs, environmental dangers
- Inventory State: Current items and their quantities
The planner operates at two levels:
- Strategist: Decomposes the goal into an ordered list of subgoals (e.g., "craft stone pickaxe" β find_stone β mine_cobblestone Γ 3 β find_wood β craft_planks β craft_sticks β craft_pickaxe)
- Tactician: Converts each subgoal into a sequence of MineDojo-compatible actions, re-planning when the environment state changes unexpectedly
Each experience is stored as:
{
"task": str, # Task being attempted
"observation": str, # Scene description at decision point
"plan": str, # Plan that was executed
"actions": List[str], # Action sequence taken
"outcome": str, # Success/failure description
"reward": float, # MineDojo reward signal
"embedding": List[float] # Text embedding for similarity search
}At decision time, the agent retrieves the top-k most similar past experiences and includes them as in-context examples for the planner.
Rather than fine-tuning model weights, EmbodiedMind adapts through:
- Experience accumulation: Memory grows across episodes
- Failure avoidance: Failed strategies are explicitly noted in retrieved context
- Strategy transfer: Successful plans from similar tasks inform new situations
Contributions are welcome!
MIT License. See LICENSE for details.
- MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge (NeurIPS 2022, Outstanding Paper)
- Voyager: An Open-Ended Embodied Agent with Large Language Models
- ODYSSEY: Empowering Minecraft Agents with Open-World Skills (IJCAI 2025)
- STEVE-1: A Generative Model for Text-to-Behavior in Minecraft
Vishal Chauhan β vishalchauhan@outlook.sg Ph.D. Candidate, The University of Tokyo