Skip to content

vish0012/embodied-mind

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

39 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧠 EmbodiedMind

VLM-Guided Embodied Agent with Hierarchical Planning and Episodic Memory in MineDojo

Python 3.9+ License: MIT

EmbodiedMind is a research framework for building multimodal embodied agents that perceive, reason, plan, and act in open-ended 3D environments. Built on MineDojo (Minecraft), it combines Vision-Language Models (VLMs) for grounded perception, LLMs for hierarchical task planning, and an episodic memory system for experience-driven adaptation.

🎯 Key Features

  • Multimodal Perception: VLM-based visual grounding that converts raw game frames into structured scene descriptions (entities, resources, terrain, threats)
  • Hierarchical Planning: Two-level planner β€” a high-level LLM strategist decomposes goals into subgoals, and a low-level action translator maps subgoals to executable MineDojo actions
  • Episodic Memory: Vector-similarity retrieval of past experiences to enable in-context learning β€” the agent recalls what worked (and what failed) in similar situations
  • Skill Library: Reusable, composable action sequences discovered through experience and stored for future retrieval
  • Evaluation Suite: Automated benchmarking across MineDojo Harvest, Survival, and Tech Tree tasks with standardized metrics

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   EmbodiedMind Agent                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Visual  β”‚  Hierarchicalβ”‚ Episodic β”‚    Skill       β”‚
β”‚ Perceiverβ”‚   Planner    β”‚  Memory  β”‚   Library      β”‚
β”‚  (VLM)   β”‚   (LLM)      β”‚ (Vector) β”‚  (Code-as-    β”‚
β”‚          β”‚              β”‚          β”‚   Action)      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚              Action Executor (MineDojo API)          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚              MineDojo Environment (Minecraft)        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Agent Loop:

  1. Observe β†’ Capture RGB frame from MineDojo environment
  2. Perceive β†’ VLM extracts structured scene description (entities, inventory, threats)
  3. Recall β†’ Query episodic memory for relevant past experiences
  4. Plan β†’ LLM generates/updates hierarchical plan using perception + memory context
  5. Act β†’ Translate plan step into MineDojo action and execute
  6. Reflect β†’ Store outcome in episodic memory for future retrieval

πŸ“¦ Installation

Prerequisites

  • Python β‰₯ 3.9
  • JDK 8 (for Minecraft backend)
  • GPU recommended for VLM inference (or use API-based models)

Setup

git clone https://github.com/YOUR_USERNAME/embodied-mind.git
cd embodied-mind

# Create conda environment
conda create -n embodied-mind python=3.10 -y
conda activate embodied-mind

# Install MineDojo
pip install minedojo

# Install project dependencies
pip install -e .

Configure API Keys

cp .env.example .env
# Edit .env with your API key (supports OpenAI, Google Gemini, or local models)

πŸš€ Quick Start

Run a single task

python -m embodied_mind.run \
    --task "harvest_milk" \
    --model "gemini-2.0-flash" \
    --max_steps 3000 \
    --memory_enabled \
    --verbose

Run the evaluation suite

python -m embodied_mind.evaluate \
    --task_suite harvest \
    --model "gemini-2.0-flash" \
    --num_episodes 5 \
    --output_dir results/

Visualize agent behavior

python -m embodied_mind.visualize \
    --replay results/harvest_milk_ep0.json \
    --show_memory \
    --show_plan

πŸ“Š Benchmark Results

🚧 Work in Progress β€” Evaluation experiments are currently being run on MineDojo Programmatic Tasks (Harvest, Survival, Tech Tree). Results will be updated here as they become available.

πŸ“ Project Structure

embodied-mind/
β”œβ”€β”€ embodied_mind/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ agent.py              # Main agent loop
β”‚   β”œβ”€β”€ perceiver.py          # VLM-based visual perception
β”‚   β”œβ”€β”€ planner.py            # Hierarchical LLM planner
β”‚   β”œβ”€β”€ memory.py             # Episodic memory with vector retrieval
β”‚   β”œβ”€β”€ skills.py             # Skill library management
β”‚   β”œβ”€β”€ action_executor.py    # MineDojo action translation
β”‚   β”œβ”€β”€ run.py                # Single-task runner
β”‚   β”œβ”€β”€ evaluate.py           # Benchmark evaluation
β”‚   └── visualize.py          # Replay visualization
β”œβ”€β”€ configs/
β”‚   β”œβ”€β”€ default.yaml          # Default agent config
β”‚   └── tasks.yaml            # Task definitions
β”œβ”€β”€ prompts/
β”‚   β”œβ”€β”€ perceiver.txt         # VLM perception prompt
β”‚   β”œβ”€β”€ planner.txt           # LLM planning prompt
β”‚   └── reflector.txt         # Post-action reflection prompt
β”œβ”€β”€ results/                  # Evaluation outputs
β”œβ”€β”€ assets/                   # Architecture diagrams
β”œβ”€β”€ tests/
β”‚   └── test_agent.py
β”œβ”€β”€ setup.py
β”œβ”€β”€ .env.example
└── README.md

πŸ”¬ Research Details

Visual Perception

The perceiver sends RGB frames to a VLM (Gemini, GPT-4o, or local LLaVA) with a structured prompt asking for:

  • Entities: Nearby mobs, animals, villagers with estimated distances
  • Resources: Visible blocks, items, craftable materials
  • Terrain: Biome type, elevation, obstacles
  • Threats: Hostile mobs, environmental dangers
  • Inventory State: Current items and their quantities

Hierarchical Planning

The planner operates at two levels:

  • Strategist: Decomposes the goal into an ordered list of subgoals (e.g., "craft stone pickaxe" β†’ find_stone β†’ mine_cobblestone Γ— 3 β†’ find_wood β†’ craft_planks β†’ craft_sticks β†’ craft_pickaxe)
  • Tactician: Converts each subgoal into a sequence of MineDojo-compatible actions, re-planning when the environment state changes unexpectedly

Episodic Memory

Each experience is stored as:

{
    "task": str,               # Task being attempted
    "observation": str,        # Scene description at decision point
    "plan": str,               # Plan that was executed
    "actions": List[str],      # Action sequence taken
    "outcome": str,            # Success/failure description
    "reward": float,           # MineDojo reward signal
    "embedding": List[float]   # Text embedding for similarity search
}

At decision time, the agent retrieves the top-k most similar past experiences and includes them as in-context examples for the planner.

Adaptation via In-Context Learning

Rather than fine-tuning model weights, EmbodiedMind adapts through:

  1. Experience accumulation: Memory grows across episodes
  2. Failure avoidance: Failed strategies are explicitly noted in retrieved context
  3. Strategy transfer: Successful plans from similar tasks inform new situations

🀝 Contributing

Contributions are welcome!

πŸ“„ License

MIT License. See LICENSE for details.

πŸ“š References

βœ‰οΈ Contact

Vishal Chauhan β€” vishalchauhan@outlook.sg Ph.D. Candidate, The University of Tokyo

About

VLM-Guided Embodied Agent with Hierarchical Planning and Episodic Memory in MineDojo

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages