Skip to content

fuzsh/lost

Repository files navigation

When Agents Get Lost: Dissecting Failure Modes in Graph-Based Navigation Instruction Evaluation

Farzad Shami, Kimia Abedini, Seyed Hossein Hosseini, Henrikki Tenkanen

Aalto University

[Paper] [Project Page] [Data Explorer]

Abstract

Vision-and-Language Navigation (VLN) requires agents to interpret natural language instructions for spatial reasoning, yet evaluating instruction quality remains challenging when agents fail. This gap highlights a critical need for a principled understanding of why navigation instructions fail. To address this, we first present a taxonomy of navigation instruction failures that clusters failure cases into four categories: (i) linguistic properties, (ii) topological constraints, (iii) agent limitations, and (iv) execution barriers. We then introduce a dataset of over 450 annotated failure navigation traces collected from GROKE, a vision-free evaluation framework that utilizes OpenStreetMap (OSM) data. Our analysis demonstrates that agent limitations (74.2%) constitute the dominant error category, with stop-location errors and planning failures as the most frequent subcategories. The dataset and taxonomy together provide actionable insights that enable instruction generation systems to identify and avoid under-specification patterns while allowing evaluation frameworks to systematically distinguish between instruction quality issues and agent-specific artifacts.

Failure Taxonomy

We developed a hierarchical four-axis categorization framework:

Dimension Code Prevalence Top Subcategory
Agent Limitations A 74.2% Stop-location errors (A5, 49.9%)
Execution Failures E 46.5% Timing/temporal errors (E4, 49.8%)
Linguistic Properties L 23.2% Over-specification (L2, 42.1%)
Topological Constraints T 12.4% Junction complexity (T5, 73.8%)

Linguistic (L)

  • L1 Under-specification: Spatial under-specification, landmark under-specification, missing stopping criteria
  • L2 Over-specification: Visual-only features (e.g., "red building" not in OSM), transient objects
  • L3 Temporal ambiguity: Overlapping spatial regions, implicit sequencing, conditional dependencies
  • L4 Numerical inconsistency: Numerical mismatch, distance quantification errors
  • L5 Referential complexity: Deictic references, anaphora chains, implicit references
  • L6 Negation confusion: Action negation, landmark negation, double negation
  • L7 Directional ambiguity: Relative direction confusion, conflicting cues, cardinal-relative mismatch
  • L8 Scale/Distance vagueness: Temporal vagueness, perceptual scale issues
  • L9 Landmark co-reference: Synonym variation, partial references

Topological (T)

  • T1 Path ambiguity: Parallel routes, loop alternatives
  • T2 Landmark displacement: Premature mention, post-turn reference, opposite-side displacement
  • T3 Scale mismatch: Distance underestimation, landmark density
  • T4 Stop-location errors: Stop-too-late (overshoot), stop-too-early (undershoot), missed stopping cue
  • T5 Junction complexity: Five-way+ intersections, offset intersections, roundabout navigation

Agent Limitations (A)

  • A1 POI grounding failure: Similarity thresholds
  • A2 Heading initialization: Compass calibration, map orientation mismatch
  • A3 Sub-goal segmentation: Over-segmentation, under-segmentation, incorrect action primitives
  • A4 Context window saturation: Token overflow, information overload, visibility threshold miscalibration
  • A5 Perception failures: Spatial relation errors, object hallucination
  • A6 Planning and reasoning: Goal confusion, cascading failures, premature termination
  • A7 Memory and state tracking: Visited location amnesia, sub-goal state loss, instruction forgetting
  • A8 Multi-agent coordination: Incompatible formats, status misalignment

Execution (E)

  • E1 Looping behavior: Immediate loops, cycle loops, area circling
  • E2 Exploration inefficiency: Random walk, boundary avoidance
  • E3 Action execution errors: Turn angle errors, distance errors
  • E4 Timing and temporal: Premature actions, delayed actions, simultaneous action conflicts
  • E5 Verification failures: False positive completion, false negative completion, no verification

Dataset

The dataset contains 492 annotated navigation failure traces collected from GROKE's evaluation of Map2Seq test sets, representing 35.14% of evaluated instructions. Each trace includes:

  • Agent's step-by-step reasoning
  • Identified sub-goals
  • Extracted POIs
  • Map with annotated path
  • Graph network showing traversed path with marked POIs

Key Findings

  • Agent limitations (74.2%) are the dominant error category, with stop-location errors and planning failures as the most frequent subcategories.
  • Execution failures (46.5%) are the second most common, dominated by timing and temporal errors.
  • Half of all failures exhibit multi-dimensional error patterns, indicating compounded issues rather than isolated problems.
  • Six design implications are derived for improving vision-free navigation systems, including spatial representation improvements, ambiguous terminology handling, landmark detection enhancement, action timing refinement, junction complexity handling, and stop-location refinement.

Detailed Error Counts

Agent Dimension (n=365)

Error Type Count % of Agent Errors
Stop-location errors 182 49.9%
Planning and reasoning 117 32.1%
POI grounding failure 100 27.4%
Context window saturation 32 8.8%
Sub-goal segmentation 24 6.6%
Heading initialization 12 3.3%
Multi-agent coordination 11 3.0%
Memory and state tracking 3 0.8%

Execution Dimension (n=229)

Error Type Count % of Execution Errors
Timing and temporal 114 49.8%
Verification failures 90 39.3%
Exploration inefficiency 75 32.8%
Action execution errors 9 3.9%
Looping behavior 1 0.4%

Linguistic Dimension (n=114)

Error Type Count % of Linguistic Errors
Over-specification 48 42.1%
Under-specification 28 24.6%
Directional ambiguity 17 14.9%
Scale/Distance vagueness 16 14.0%
Referential complexity 9 7.9%
Numerical inconsistency 7 6.1%
Temporal ambiguity 6 5.3%
Negation confusion 2 1.8%
Landmark co-reference 1 0.9%

Topological Dimension (n=61)

Error Type Count % of Topological Errors
Junction complexity 45 73.8%
Path ambiguity 11 18.0%
Landmark displacement 3 4.9%
Scale mismatch 3 4.9%
Connectivity Violation 1 1.6%

Key Co-occurrence Statistics

Dimension-Level Co-occurrences

Dimension Pair Count
Agent + Execution 153
Agent + Linguistic 72
Execution + Linguistic 41
Execution + Topological 25
Agent + Topological 24
Linguistic + Topological 14

Most Frequent Agent → Execution Cascades

Agent Error Execution Error Count
Planning and reasoning Verification failures 45
Planning and reasoning Timing and temporal 36
POI grounding failure Verification failures 30
Planning and reasoning Exploration inefficiency 29
Stop-location errors Verification failures 22
POI grounding failure Exploration inefficiency 22

Most Frequent Linguistic → Agent Cascades

Linguistic Error Agent Error Count
Over-specification Stop-location errors 22
Over-specification POI grounding failure 16
Scale/Distance vagueness Stop-location errors 8
Over-specification Planning and reasoning 8
Under-specification Planning and reasoning 6

Project Structure

  • data/ - Navigation data
  • annotations/ - Annotated failure traces
  • annotation_data/ - Raw annotation data
  • annotation-tool/ - Web-based annotation tool
  • explorer-tool/ - Interactive data exploration tool
  • src/ - Source code
  • results/ - Evaluation results
  • evaluation_metrics.py - Evaluation metrics
  • inter-annotator.py - Inter-annotator agreement computation
  • index.html - Project landing page (GitHub Pages)

Citation

Groke Original Paper:

@misc{shami2026grokevisionfreenavigationinstruction,
      title={GROKE: Vision-Free Navigation Instruction Evaluation via Graph Reasoning on OpenStreetMap}, 
      author={Farzad Shami and Subhrasankha Dey and Nico Van de Weghe and Henrikki Tenkanen},
      year={2026},
      eprint={2601.07375},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.07375}, 
}

About

Language-Oriented Spatial Taxonomy (LOST)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •