Weizhen Wang, Chenda Duan, Zhenghao Peng, Yuxin Liu, Bolei Zhou
This is the official repository for Embodied Scene Understanding for Vision Language Models via MetaVQA from CVPR 2025. It contains the necessary toolkit for creating this benchmark, including both VQA datasets and closed-loop challenges.
Clone the repository and create a virtual environment/Conda envrionment with Python 3.11
$ git clone
$ cd MetaVQA
$ conda create -n metavqa python=3.11 -y
$ conda activate metavqainstall the metadrive dependencies by running
$ pip install -e .MetaVQA needs some extra pacakges. You can use
$ pip install PyYAML
$ pip install imageio[ffmpeg]
$ pip install scipyOnce the previous steps are finished, use the following for installation verification
$ python -m metadrive.examples.drive_in_single_agent_envFor visually diverse simulation envrionments, download and unzip the asset_v0.0.4.zip and adj_parameter_folder_v0.0.4.zip from this link. Move the test folder within asset_v0.0.4.zip into metadrive/assets/models. You need to pull the vanilla metadrive asset first, and this will be automatically done when you do verification.
You should have the following file structure
-MetaVQA
-metadrive
-assets
-models
-test/*
Lastly, modify the path_config.yaml by overwriting
...
# Specify location of the asset within metadrive, download "asset-model.zip" from github release and put it at corresponding location.
metadriveasset: <absolute path to MetaVQA's parent folder>/MetaVQA/metadrive/assets/models/test
# The parent path for the subfolders below
parentfolder: <absolute path to the parameter folder>/adj_parameter_folder_v0.0.4
...You can verify the installation of additional assets by running
$ python -m metadrive.examples.drive_in_real_env_diverseAs part of the MetaVQA-Dataset leverage nuScenes Dataset, we provide a brief tutorial to set it up. Go to the nuScenes official webpage download the dataset. Additional, this website provides details on the dataset composition.
Much of the data collection is done using the nuScenes-Devkit. We recommend starting a dedicated virtualized environments:
$ conda create -n nusc python=3.7 -y
$ conda activate nusc
$ pip install nuscenes-devkitIn case of confusion, check out the devkit's implementation here
We prepared two distinct pipelines for curating the real-world scenarios and simulator-rendered scenarios. Checkout scripts_cvpr/scene_gen/nusc_real.sh and scripts_cvpr/scene_gen/waymo_sim.sh for examples. You can also see sample_scripts/test_scengen.sh.
Please download the nuScenes Dataset via the official website if you want to utilize nuScenes scenarios for VQA generation. For Waymo Open Motion Dataset(WOMD), you can refer to ScenarioNet to pre-process the tfrecords into pkl files compatible with MetaDrive simulator.
We utilize the nuScenes_Devkit tools to prepare nuScenes scenarios. You can checkout vqa/scenegen/nusc_devkit_annotation.py for implementation details. Suppose you download the nuScenes dataset at nusc folder, you should overwrite vqa/scenegen/macros.py with
NUSC_PATH = 'nusc'
NUSC_VERSION = 'v1.0-trainval'You will have output in the following structure
nusc_scenarios/
|-scene-0510_0_40/
| |
| |-1_0/
| | |-world_1_0.json (the recorded scene graph)
| | |-rgb_front_1_0.json (CAM_FRONT RGB image)
| | |-mask_front_1_0.png (Instance Segmentation Mask, in boxes)
| | |id2corner_1_0.json (Map an object id in the scene graph to a 2D pixel coordinates)
| | |id2c_1_0.json (Map an object id to an instance color, rounded to 5 in float)
| |-1_1
| |...
| |-<seed_id>_<keyframe_idx>
|-scene-0515_0_40/*
|...
|-<scene_id>_<keyframe_start>_<keyframe_end>
Check out vqa/scenegen/metadrive_annotation.py for more details.
Note that nuScenes scenarios can also be converted to this format, and you can essentially create a "digital twin" for the same traffic layout. Check out vqa/scenegen/nusc_metadrive_annotation.py for paired aggregation. Note that you have to first collect camera data in order to use it, see vqa/scenegen/macros.py:
PAIRING_PATH = "/bigdata/weizhen/data_nusc_multiview_trainval.json"
NUSCENES_SN_PATH = "/bigdata/datasets/scenarionet/nuscenes/trainval"Checkout vqa.vqagen.set_of_marks.py for implementation details. You can freely specify the annotation style(bounding box v.s. contours v.s. masks, etc.) Note that the SoM will be automatically applied during the VQA generation process.
Checkout sample_scripts/test_vqagen.sh for sample code and vqa/vqa_gen/static_question_generation.py for implementation details. All the question templates are defined in vqa/vqa_gen/questions_templates.json.
The Training-Validation and Testing sets used in our CVPR 2025 paper have been released on Hugginface in JSON files. Check the table below.
| Split | URL | Size (#VQAs) |
|---|---|---|
| Train-Val | https://huggingface.co/datasets/Weizhen011210/MetaVQA-Train | 150 K |
| Test | https://huggingface.co/datasets/Weizhen011210/MetaVQA-Eval | 9,375 |
A much larger version will be released soon.
To evaluate your VLM's performance on the test set, simply download the dataset from the link(suppose you name it test.json) above and prepare your generated responses in a single JSON file(let's say, response.json) with the following structure:
{
"0": {
"question": "Suppose our current speed is moderate(10-30 mph), and we perform action \"BRAKE\" for 2.0 seconds. How far will we end up from our current position? Select the best option from: (A) Very close(0-2m); (B) Close(2-10m); (C) Medium(10-30m); (D) Far(30m-).",
"answer": "B",
"model_response": "B",
"explanation": "",
"type": "embodied_distance",
"objects": [],
"world": ["/bigdata/weizhen/metavqa_cvpr/scenarios/nusc_real/scene-0042_0_40/14_19"],
"obs": "/data_weizhen/metavqa_cvpr/datasets/test/test/obs/0.png"
},
"1":...
}In this example, the "0" is the <qid> that's recorded in test.json, as well as all other fields besides model_response. We keep these additional meta-informations for collecting statistics, and they will not impact the test accuracy of your model.
To calculate the test accuracy, simply modify the vqa/eval/analyze_response.py's first lines
IGNORED_Q_TYPE = ["describe_scenario"] #describe_scenario is only used for training
path_template = <path to "response.json">
merged_path = <arbitrary path A>
stat_path = <arbitrary path B>
domained_path = <path to "test.json">and run python -m vqa.eval.analyze_response. You will find the collected statistics in <arbitrary path B>
We prepared 60 real-world scenarios as well as 60 safety-critical scenarios for closed-loop evaluation. You can find them in closed_loop/assets/scenarios
The following additional packages have been validated on our Ubuntu 24.04.1 LTS machine with
NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6
$ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
$ pip install transformers==4.45.2 #(for compatibility with InternVL2 models)
$ pip install einops==0.8.1 #(for using pre-trained InternVL2 models)
$ pip install timm==1.0.19 #(for using pre-trained InternVL2 models)
$ pip install sentencepiece==0.2.0 #(for using pre-trained InternVL2 models)
$ pip install flash-attn==2.8.2 --no-build-isolation #(for using pre-trained InternVL2 models)You can define your own load_model and inference functions, with the following signatures:
def load_model(*args, **kwargs)
"""
You are free to modify the arguments, but you have to return
Return:
model : AutoModel, processor: AutoProcessor, tokenizer: AutoTokenizer
"""
...
return model, processor, tokenizer
def inference(*args, **kwargs):
"""
Generate responses based on the current observation and navigation prompt.
Return:
response: str, the generated token sequence. Parsing will be taken care of later
"""
return responseOnce these methods are defined, simply run python -m closed_loop.closed_loop_benchmark to evaluate your model in the closed-loop driving task. We've prepared sample scripts in scripts_cvpr/closed_loops for illustrations. You can checkout sample_scripts/closed_loop_pretrained*.sh for examples.
You can find fine-tuned checkpoints for models referred in the paper here. You can simply load them using the Transformers package and inference in consistent paradigm their base models.
| Table | Code Name | URL |
|---|---|---|
| 4 | Qwen2-finetuned | https://huggingface.co/Weizhen011210/Qwen2-VL_MetaVQA-Train |
| 4 | Llama3.2-finetuned | https://huggingface.co/Weizhen011210/Llama3.2_MetaVQA-Train |
| 5 | Qwen2-tuned | https://huggingface.co/Weizhen011210/Qwen2-VL_MetaVQA-Closed-Loop |
| 5 | Llama3.2-tuned | https://huggingface.co/Weizhen011210/Llama3.2_MetaVQA-Closed-Loop |
| 5 | InternVL2-4B-tuned | https://huggingface.co/Weizhen011210/InternVL2-4B_MetaVQA-Closed-Loop |
| 5 | InternVL2-8B-tuned | https://huggingface.co/Weizhen011210/InternVL2-8B_MetaVQA-Closed-Loop |
MetaVQA is built on top of MetaDrive simulator. Safety-critical scenarios are generated using CAT.
If you find our work useful, please cite as follows:
@inproceedings{wang2025metavqa,
title={Embodied Scene Understanding for Vision Language Models via MetaVQA},
author={Wang, Weizhen and Duan, Chenda and Peng, Zhenghao and Liu, Yuxin and Zhou, Bolei},
booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2025},
}