From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation
[π Website] [π Paper] [π€ Models] [π― Datasets] [π¬ Demo]
We present FSD (From Seeing to Doing) with:
Embodied-FSD Model: We develop FSD, a novel vision-language model that generates intermediate representations through spatial relationship reasoning, providing fine-grained guidance for robotic manipulation. It integrates Spatial Relationship-Focused Chain-of-Thought (Sr-CoT) reasoning while maintaining powerful general capabilities.
VABench: We propose VABench, a more challenging benchmark for evaluating visual aids generation capabilities in robotic manipulation scenarios.
Figure 1: Overview of FSD
Figure 2: Spatial relationship-focused reasoning process (SrCoT).
- [2025-07] β‘οΈ We have updated SIMPLER ENV branch and LLM-based evaluation!
- [2025-07] π¬ We have updated the detailed training, inference, and evaluation code and readme. VABench evaluation benchmark is officially released!
- [2025-05] π Code repository is now public - welcome to try FSD for robotic manipulation!
- Clone this repository and navigate to Embodied-FSD folder
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA- Install Package
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip # enable PEP 660 support
pip install -e .
# we recommend transformers==4.31.0
pip install transformers==4.31.0- Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
Task instruction: Move the yellow block in the middle of the table.
Before prediction (original image):
Run the example code:
cd Embodied-FSD/
python affordance_point_inference_example.pyAfter prediction (visualization result):
Task instruction: put carrot on plate.
Before prediction (original image):
Run the example code:
cd Embodied-FSD/
python visual_trace_inference_example.pyAfter prediction (visualization result):
We mainly use the LLaVA and ASMv2 codebases to develop FSD. We appreciate these excellent works. The training process of FSD is divided into two stages: the first stage focuses on embodied reasoning and general spatial reasoning, while the second stage focuses on visual aids generation.
Please download the required datasets and organize them in ./data from constituting datasets:
- COCO: train2017, train2014
- GQA: images
- OCR-VQA: download script, we save all files as
.jpg - TextVQA: train_val_images
- VisualGenome: part1, part2
- CLEVR_v1.0: images
- Visual7W: images
- Flickr30K: images
- SA-1B: images (Only sa_000000-sa_000003)
- st_vqa(cauldron,llava_format), raven(cauldron), vsr(cauldron,llava_format), CLEVR-Math(MathV360K), Super-CLEVR(MathV360K): FSD-Dataset (derived from LLaVA-OneVision-Data)
- kitti, 2d3ds: FSD-Dataset (derived from SpatialQA)
- object_ref, region_ref: FSD-Dataset (derived from RoboPoint)
- bridge_data_v2: images (derived from BridgeDataV2)
- droid: FSD-Dataset (derived from DROID)
- rtx: FSD-Dataset (derived from Open-Embodidedment-X)
After downloading all datasets, organize the data as follows in ./data:
βββ coco
β βββ train2014
β βββ train2017
βββ gqa
β βββ images
βββ ocr_vqa
β βββ images
βββ textvqa
β βββ train_images
βββ vg
β βββ VG_100K
β βββ VG_100K_2
βββ CLEVR_v1.0
β βββ images
βββ Visual7W
β βββ images
βββ flickr30k
β βββ images
βββ sam
β βββ sa_000000
β βββ sa_000001
β βββ sa_000002
β βββ sa_000003
βββ st_vqa(cauldron,llava_format)
βββ raven(cauldron)
βββ vsr(cauldron,llava_format)
βββ CLEVR-Math(MathV360K)
βββ Super-CLEVR(MathV360K)
βββ SAT_images
βββ kitti
βββ 2d3ds
βββ bridge_data_v2
β βββ bridge_data_v1
β βββ bridge_data_v2
β βββ flap
β βββ rss
β βββ icra
βββ droid
β βββ ILIAD+j807b3f8+2023-05-11-17h-34m-39s
β βββ ...
βββ rtx
β βββ fractal20220817_data
β βββ ucsd_kitchen_dataset_converted_externally_to_rlds
β βββ jaco_play
β βββ ucsd_pick_and_place_dataset_converted_externally_to_rlds
βββ object_ref
βββ region_ref
In this stage, we train the model to enhance spatial reasoning ability. We finetune the FSD model based on the ASMv2.
The JSON data used in Stage 1: Dataset Link
# Stage 1: spatial reasoning
bash scripts_fsd/stage1-fsd.shIn this stage, we enhance the model with robotics manipulation data and advanced visual aids generation.
The JSON data used in Stage 2: Dataset Link
# Stage 2: visual aids generation
bash scripts_fsd/stage2-fsd.sh- Level-1-2-3-Dataset: FSD spatial reasoning dataset
- Level-4-5-Dataset: FSD visual aids generation dataset
As in ASMv2, in our dataset, we use to annotate target objects and to annotate spatial relations. Each bounding box is normalized to integer values within the range [0, 1000). Note: When training and outputting coordinates, we first pad the image into a square and then output the normalized coordinates on the square image. Special attention should be paid to this conversion process.
We used the lmms-eval framework to complete the evaluation of all benchmarks, and we are grateful for their outstanding work!
vabench_point_dataset.parquet and vabench_visual_trace_dataset.parquet are used for VABench-Point and VABench-Visual Trace, respectively. In the parquet files, the instruction column contains the task instructions, the images column contains the images, and the answer column contains the answers. When evaluating VABench-Point, we calculate the proportion of predicted points that fall within the answer bounding boxes as the accuracy. For VABench-Visual Trace, we compute the MAE and RMSE between the predicted trajectories and the ground-truth trajectories. Note that, to ensure fair comparison across images of different sizes, both the predicted results and the ground-truth results are converted to the 0-1000 normalized coordinate system of the padded images (since FSD predictions are already in this format, no conversion is needed for them).
We have also provided a method for evaluating visual trace generation using LLM-based evaluation.
Step 1οΈβ£: Get Model Output
Task instruction: Put the orange object inside the basket.
FSD output:
<Description>The image shows an <ref>orange object</ref><box>[[622, 424, 763, 583]]</box> sitting in a blue sink. To the left of the sink is a <ref>yellow dish rack</ref><box>[[19, 174, 494, 477]]</box>. A white spatula is positioned in front of the orange object.\n
</Description>
<Reasoning>\nTo move the orange object into the yellow dish rack, start by identifying the current position of the orange object at <point>[[694, 540]]</point>. \nLift the object slightly upwards and to the left, moving towards the dish rack. \nThe path should curve gently to avoid any obstacles, passing through intermediate points like <point>[[639, 440]]</point> and <point>[[513, 340]]</point>. \nFinally, lower the object into the dish rack, ending at the target position <box>[[213, 273, 339, 419]]</box> with the final point at <point>[[257, 390]]</point>.
</Reasoning>
<Answer>The visual trace for placing the orange object into the yellow dish rack is \n<point>[[694, 540], [682, 515], [639, 440], [597, 377], [513, 340], [419, 330], [337, 343], [257, 390]]</point>.
</Answer>
Step 2οΈβ£: Visualization
Step 3οΈβ£: LLM Evaluation
We need to input prompts containing instructions along with visualized images, and have LLM perform the scoring.
Here is the prompt:
You are an expert evaluator in robotic manipulation and visual reasoning. Your job is to assess the quality of predicted trajectories based on task instructions and visual inputs.
You are given:
- A task instruction describing an object manipulation task.
- An image showing a predicted trajectory.
**Note:**
- In the image, the red circle indicates the start point, and the blue diamond indicates the end point.
- The trajectory represents the predicted movement path of the manipulated object, not the robot or end-effector.
- You should **evaluate the predicted trajectory as a proposed motion for the object that is supposed to be moved**, based on the task instruction β **not based on the static positions of objects in the image**. The objects have not actually moved.
**Evaluation Criteria (listed in order of importance):**
1. **Task Alignment and Success (most important)**
- Does the trajectory clearly and accurately fulfill the task instruction?
- **The trajectory must start at the correct location and end at a target position that aligns with the task goal.**
- Large deviations in the starting or ending point (e.g., wrong object, wrong destination, or stopping short of the goal) should result in a low score, even if the rest of the trajectory is smooth.
- If the task is not accomplished (due to incorrect goal interpretation or spatial execution), the score should be low regardless of other qualities.
2. **Feasibility**
- Is the movement physically plausible, smooth, and continuous?
- Are there any unrealistic discontinuities, sharp turns, or impossible transitions?
- Even if the movement is feasible, it should not receive a high score if the task is not completed.
3. **Obstacle Avoidance / Safety**
- Does the trajectory reasonably avoid collisions with surrounding objects?
- Minor risks may be tolerated if the task is completed successfully, but major or clear collisions should reduce the score.
**Scoring Guideline:**
- If the task is **not accomplished**, or if the start or end point is significantly incorrect, the score should typically be **4 or below**.
- If the task is completed but the trajectory has issues (e.g., roughness, minor risk of collision), a score in the **6β8** range is appropriate.
- A **score of 9β10** should be given only when the trajectory clearly completes the task, with good start/end accuracy, smooth motion, and reasonable safety.
Based on these criteria, provide a single overall score from 1 (very poor) to 10 (excellent), reflecting how well the task is accomplished.
Respond strictly in the following format:
Score: <1-10>
Explanation: <brief justification>
The task instruction is:
{task_instruction}
Please give your response.
LLM Output:
Score: 10
Explanation: The trajectory starts at the orange object and ends inside the basket, accurately fulfilling the instruction to put the orange object inside the basket. The path is smooth, continuous, and physically plausible, with no sharp turns or unrealistic movements. There is no significant risk of collision with other objects. Task is fully accomplished with good safety and feasibility.
The complete visualization and evaluation code can be executed:
cd Embodied-FSD-Github/visual_trace_llm_score
python gpt_score_example.pyWe refer to the SIMPLERENV FSD branch.
We sincerely thank the following outstanding open-source projects and research works, which have provided an important foundation and support for the development of FSD:
This project is licensed under the Apache 2.0 License. For details, please see the LICENSE file.
If you use FSD in your research, please cite our paper:
@misc{yuan2025seeingdoingbridgingreasoning,
title={From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation},
author={Yifu Yuan and Haiqin Cui and Yibin Chen and Zibin Dong and Fei Ni and Longxin Kou and Jinyi Liu and Pengyi Li and Yan Zheng and Jianye Hao},
year={2025},
eprint={2505.08548},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2505.08548},
}






