Farzad Shami, Kimia Abedini, Seyed Hossein Hosseini, Henrikki Tenkanen
Aalto University
[Paper] [Project Page] [Data Explorer]
Vision-and-Language Navigation (VLN) requires agents to interpret natural language instructions for spatial reasoning, yet evaluating instruction quality remains challenging when agents fail. This gap highlights a critical need for a principled understanding of why navigation instructions fail. To address this, we first present a taxonomy of navigation instruction failures that clusters failure cases into four categories: (i) linguistic properties, (ii) topological constraints, (iii) agent limitations, and (iv) execution barriers. We then introduce a dataset of over 450 annotated failure navigation traces collected from GROKE, a vision-free evaluation framework that utilizes OpenStreetMap (OSM) data. Our analysis demonstrates that agent limitations (74.2%) constitute the dominant error category, with stop-location errors and planning failures as the most frequent subcategories. The dataset and taxonomy together provide actionable insights that enable instruction generation systems to identify and avoid under-specification patterns while allowing evaluation frameworks to systematically distinguish between instruction quality issues and agent-specific artifacts.
We developed a hierarchical four-axis categorization framework:
| Dimension | Code | Prevalence | Top Subcategory |
|---|---|---|---|
| Agent Limitations | A | 74.2% | Stop-location errors (A5, 49.9%) |
| Execution Failures | E | 46.5% | Timing/temporal errors (E4, 49.8%) |
| Linguistic Properties | L | 23.2% | Over-specification (L2, 42.1%) |
| Topological Constraints | T | 12.4% | Junction complexity (T5, 73.8%) |
- L1 Under-specification: Spatial under-specification, landmark under-specification, missing stopping criteria
- L2 Over-specification: Visual-only features (e.g., "red building" not in OSM), transient objects
- L3 Temporal ambiguity: Overlapping spatial regions, implicit sequencing, conditional dependencies
- L4 Numerical inconsistency: Numerical mismatch, distance quantification errors
- L5 Referential complexity: Deictic references, anaphora chains, implicit references
- L6 Negation confusion: Action negation, landmark negation, double negation
- L7 Directional ambiguity: Relative direction confusion, conflicting cues, cardinal-relative mismatch
- L8 Scale/Distance vagueness: Temporal vagueness, perceptual scale issues
- L9 Landmark co-reference: Synonym variation, partial references
- T1 Path ambiguity: Parallel routes, loop alternatives
- T2 Landmark displacement: Premature mention, post-turn reference, opposite-side displacement
- T3 Scale mismatch: Distance underestimation, landmark density
- T4 Stop-location errors: Stop-too-late (overshoot), stop-too-early (undershoot), missed stopping cue
- T5 Junction complexity: Five-way+ intersections, offset intersections, roundabout navigation
- A1 POI grounding failure: Similarity thresholds
- A2 Heading initialization: Compass calibration, map orientation mismatch
- A3 Sub-goal segmentation: Over-segmentation, under-segmentation, incorrect action primitives
- A4 Context window saturation: Token overflow, information overload, visibility threshold miscalibration
- A5 Perception failures: Spatial relation errors, object hallucination
- A6 Planning and reasoning: Goal confusion, cascading failures, premature termination
- A7 Memory and state tracking: Visited location amnesia, sub-goal state loss, instruction forgetting
- A8 Multi-agent coordination: Incompatible formats, status misalignment
- E1 Looping behavior: Immediate loops, cycle loops, area circling
- E2 Exploration inefficiency: Random walk, boundary avoidance
- E3 Action execution errors: Turn angle errors, distance errors
- E4 Timing and temporal: Premature actions, delayed actions, simultaneous action conflicts
- E5 Verification failures: False positive completion, false negative completion, no verification
The dataset contains 492 annotated navigation failure traces collected from GROKE's evaluation of Map2Seq test sets, representing 35.14% of evaluated instructions. Each trace includes:
- Agent's step-by-step reasoning
- Identified sub-goals
- Extracted POIs
- Map with annotated path
- Graph network showing traversed path with marked POIs
- Agent limitations (74.2%) are the dominant error category, with stop-location errors and planning failures as the most frequent subcategories.
- Execution failures (46.5%) are the second most common, dominated by timing and temporal errors.
- Half of all failures exhibit multi-dimensional error patterns, indicating compounded issues rather than isolated problems.
- Six design implications are derived for improving vision-free navigation systems, including spatial representation improvements, ambiguous terminology handling, landmark detection enhancement, action timing refinement, junction complexity handling, and stop-location refinement.
| Error Type | Count | % of Agent Errors |
|---|---|---|
| Stop-location errors | 182 | 49.9% |
| Planning and reasoning | 117 | 32.1% |
| POI grounding failure | 100 | 27.4% |
| Context window saturation | 32 | 8.8% |
| Sub-goal segmentation | 24 | 6.6% |
| Heading initialization | 12 | 3.3% |
| Multi-agent coordination | 11 | 3.0% |
| Memory and state tracking | 3 | 0.8% |
| Error Type | Count | % of Execution Errors |
|---|---|---|
| Timing and temporal | 114 | 49.8% |
| Verification failures | 90 | 39.3% |
| Exploration inefficiency | 75 | 32.8% |
| Action execution errors | 9 | 3.9% |
| Looping behavior | 1 | 0.4% |
| Error Type | Count | % of Linguistic Errors |
|---|---|---|
| Over-specification | 48 | 42.1% |
| Under-specification | 28 | 24.6% |
| Directional ambiguity | 17 | 14.9% |
| Scale/Distance vagueness | 16 | 14.0% |
| Referential complexity | 9 | 7.9% |
| Numerical inconsistency | 7 | 6.1% |
| Temporal ambiguity | 6 | 5.3% |
| Negation confusion | 2 | 1.8% |
| Landmark co-reference | 1 | 0.9% |
| Error Type | Count | % of Topological Errors |
|---|---|---|
| Junction complexity | 45 | 73.8% |
| Path ambiguity | 11 | 18.0% |
| Landmark displacement | 3 | 4.9% |
| Scale mismatch | 3 | 4.9% |
| Connectivity Violation | 1 | 1.6% |
| Dimension Pair | Count |
|---|---|
| Agent + Execution | 153 |
| Agent + Linguistic | 72 |
| Execution + Linguistic | 41 |
| Execution + Topological | 25 |
| Agent + Topological | 24 |
| Linguistic + Topological | 14 |
| Agent Error | Execution Error | Count |
|---|---|---|
| Planning and reasoning | Verification failures | 45 |
| Planning and reasoning | Timing and temporal | 36 |
| POI grounding failure | Verification failures | 30 |
| Planning and reasoning | Exploration inefficiency | 29 |
| Stop-location errors | Verification failures | 22 |
| POI grounding failure | Exploration inefficiency | 22 |
| Linguistic Error | Agent Error | Count |
|---|---|---|
| Over-specification | Stop-location errors | 22 |
| Over-specification | POI grounding failure | 16 |
| Scale/Distance vagueness | Stop-location errors | 8 |
| Over-specification | Planning and reasoning | 8 |
| Under-specification | Planning and reasoning | 6 |
data/- Navigation dataannotations/- Annotated failure tracesannotation_data/- Raw annotation dataannotation-tool/- Web-based annotation toolexplorer-tool/- Interactive data exploration toolsrc/- Source coderesults/- Evaluation resultsevaluation_metrics.py- Evaluation metricsinter-annotator.py- Inter-annotator agreement computationindex.html- Project landing page (GitHub Pages)
Groke Original Paper:
@misc{shami2026grokevisionfreenavigationinstruction,
title={GROKE: Vision-Free Navigation Instruction Evaluation via Graph Reasoning on OpenStreetMap},
author={Farzad Shami and Subhrasankha Dey and Nico Van de Weghe and Henrikki Tenkanen},
year={2026},
eprint={2601.07375},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2601.07375},
}