🛋️ Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities

This repository provides the code and instructions for using the evaluation protocol to systematically assess the spatial reasoning capabilities of VLMs, COnsistent Multilingual Frame Of Reference Test (COMFORT). Follow the steps below to set up the environment, generate data (optional), and run experiments. Feel free to create an issue if you encounter any problems. We also welcome pull requests.

Setup Environment

Clone the repository and create a conda environment using the provided environment.yml file:

git clone https://github.com/sled-group/COMFORT.git
cd comfort_utils
conda env create -f environment.yml

After creating the environment:

conda activate comfort

Then, install editable packages:

cd models/GLAMM
pip install -e .

cd models/llava
pip install -e .

cd models/InternVL/internvl_chat
pip install -e .

You can also use Poetry to setup the environment.

Prepare data

Firstly, make a data directory:

mkdir data

(Option 1.) Download data from Huggingface

wget https://huggingface.co/datasets/sled-umich/COMFORT/resolve/main/comfort_ball.zip?download=true -O data/comfort_ball.zip
unzip data/comfort_ball.zip -d data/
wget https://huggingface.co/datasets/sled-umich/COMFORT/resolve/main/comfort_car_ref_facing_left.zip?download=true -O data/comfort_car_ref_facing_left.zip
unzip data/comfort_car_ref_facing_left.zip -d data/
wget https://huggingface.co/datasets/sled-umich/COMFORT/resolve/main/comfort_car_ref_facing_right.zip?download=true -O data/comfort_car_ref_facing_right.zip
unzip data/comfort_car_ref_facing_right.zip -d data/

(Option 2.) Data generation

pip install gdown
python download_assets.py
chmod +x generate_dataset.sh
./generate_dataset.sh

Add API Credentials

touch comfort_utils/model_utils/api_keys.py

Prepare OpenAI and DeepL API keys and add below to api_keys.py

APIKEY_OPENAI = <YOUR_API_KEY>
APIKEY_DEEPL = <YOUR_API_KEY>

Prepare Google Cloud Translate API credentials (.json)

Run Experiments

./run_english_ball_experiments.sh
./run_english_car_left_experiments.sh
./run_english_car_right_experiments.sh

export GOOGLE_APPLICATION_CREDENTIALS="your_google_application_credentials_path.json"
./run_multilingual_ball_experiments.sh
./run_multilingual_car_left_experiments.sh
./run_multilingual_car_right_experiments.sh

Run Evaluations

English

Preferred Coordinate Transformation (Table 2 & Table 7):
```
python gather_results.py --mode cpp --cpp convention
```

Preferred Frame of Reference (Table 3 & Table 8):

python gather_results.py --mode cpp --cpp preferredfor

Perspective Taking (Table 4 & Table 9):

python gather_results.py --mode cpp --cpp perspective

Comprehensive Evaluation (Table 5):

python gather_results.py --mode comprehensive

Multilingual (Figure 8 & Table 10)

python gather_results_multilingual.py

After evaluation completes:

cd results/eval
python eval_multilingual_preferredfor_raw.py

Evaluate More Models

We refer to Model Wrapper.

Common Problems and Solutions

ImportError: libcupti.so.11.7: cannot open shared object file: No such file or directory

pip install torch==2.0.0+cu118 torchvision==0.15.1+cu118 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu118

Citation 🖋️

@misc{zhang2024visionlanguagemodelsrepresentspace,
       title={Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities}, 
       author={Zheyuan Zhang and Fengyuan Hu and Jayjun Lee and Freda Shi and Parisa Kordjamshidi and Joyce Chai and Ziqiao Ma},
       year={2024},
       eprint={2410.17385},
       archivePrefix={arXiv},
       primaryClass={cs.CL},
       url={https://arxiv.org/abs/2410.17385},
     }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🛋️ Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities

Table of Contents

Setup Environment

Prepare data

(Option 1.) Download data from Huggingface

(Option 2.) Data generation

Add API Credentials

Run Experiments

Run Evaluations

English

Multilingual (Figure 8 & Table 10)

Evaluate More Models

Common Problems and Solutions

Citation 🖋️

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
comfort_utils		comfort_utils
data_generation		data_generation
models		models
results/eval		results/eval
.gitignore		.gitignore
README.md		README.md
comfort.jpg		comfort.jpg
download_assets.py		download_assets.py
environment.yml		environment.yml
gather_results.py		gather_results.py
gather_results_multilingual.py		gather_results_multilingual.py
generate_dataset.sh		generate_dataset.sh
poetry.lock		poetry.lock
run_english_ball_experiments.sh		run_english_ball_experiments.sh
run_english_car_left_experiments.sh		run_english_car_left_experiments.sh
run_english_car_right_experiments.sh		run_english_car_right_experiments.sh
run_multilingual_ball_experiments.sh		run_multilingual_ball_experiments.sh
run_multilingual_car_left_experiments.sh		run_multilingual_car_left_experiments.sh
run_multilingual_car_right_experiments.sh		run_multilingual_car_right_experiments.sh
spatial_eval.py		spatial_eval.py
spatial_gen.py		spatial_gen.py
spatial_gen_api_multilingual.py		spatial_gen_api_multilingual.py

sled-group/COMFORT

Folders and files

Latest commit

History

Repository files navigation

🛋️ Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities

Table of Contents

Setup Environment

Prepare data

(Option 1.) Download data from Huggingface

(Option 2.) Data generation

Add API Credentials

Run Experiments

Run Evaluations

English

Multilingual (Figure 8 & Table 10)

Evaluate More Models

Common Problems and Solutions

Citation 🖋️

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages