Skip to content

Latest commit

 

History

History
142 lines (112 loc) · 5.73 KB

MLCD_Embodied.md

File metadata and controls

142 lines (112 loc) · 5.73 KB

MLCD-Embodied 🤖

Hugging Face

MLCD-Embodied is comparable to 4v in terms of embodied capabilities and possesses excellent general capabilities. The detailed evaluation results are shown below.

Image 1 Image 2

Embodied Ability Evaluation: Performance in RoboVQA and OpenEQA

MLCD-Embodied-7B LLaVA OneVision-7B GPT-4V RoboMamba
RoboVQA BLEU1 73.16 38.12 - 54.9
BLEU2 66.39 33.56 - 44.2
BLEU3 60.61 31.76 - 39.5
BLEU4 56.56 30.97 - 36.3
OpenEQA OBJECT-STATE-RECOGNITION 71.83 - 63.2 -
OBJECT-RECOGNITION 49.46 - 43.4 -
FUNCTIONAL-REASONING 54.38 - 57.4 -
SPATIAL-UNDERSTANDING 48.64 - 33.6 -
ATTRIBUTE-RECOGNITION 67.08 - 57.2 -
WORLD-KNOWLEDGE 53.87 - 50.7 -
OBJECT-LOCALIZATION 43.06 - 42.0 -

General Ability Evaluation: Comparison with LLaVA OneVision-7B and GPT-4

Dataset Split MLCD-Embodied-7B LLaVA OneVision-7B GPT-4v GPT-4o
A12D test 79.9 81.4 78.2 94.2
ChartQA test 83.0 80.0 78.5 85.7
DocVQA test 91.6 87.5 88.4 92.8
InfoVQA val 73.9 70.7 - -
InfoVQA test 70.0 68.8 - -
MMMU val 47.3 48.8 56.8 69.1
MMStar test 58.5 61.7 57.1 63.9
OCRBench - 749.0 697.0 656.0 805.0
RealWorldQA test 68.9 66.3 61.4 58.6
SeedBench image 74.9 75.4 49.9 76.2
MMbench en-dev 81.1 83.2 81.3 83.4
MMbench en-test 80.1 80.8 75.0 -
MME test 578/1603 418/1580 517/1409 -

Usage

A. Installation

git clone https://github.com/deepglint/unicom
cd unicom

# Upgrade pip and install necessary dependencies
pip install --upgrade pip
pip install -e ".[train]"

B. Inference

CUDA_VISIBLE_DEVICES=0 python infer.py --model_dir /path/to/your/model

# example:
# >> Enter 'exit' to end the conversation, 'reset' to clear the chat history.
# >> Enter image file paths (comma-separated): ./asserts/logo.png
# >> User: <image>What kind of animal is it in this picture?
# >> Assistant: The image features a stylized representation of a cat, characterized by its vibrant and abstract depiction.
# >> User: What color is this cat?
# >> Assistant: The cat in the image is primarily white with blue, orange and pink accents, creating a visually appealing and unique appearance.

C. Evaluation for Embodied Ability

Step 1

Download raw data following OpenEQA and RoboVQA(val part)

Step 2

Converting raw data into the format required for model evaluation.

# convert OpenEQA benchmark. Note: replace the paths with your own.
python llava/benchmark/make_openeqa_bmk.py

# convert RoboVQA benchmark. Note: replace the paths with your own.
python llava/benchmark/make_robovqa_bmk.py

Step 3

Make sure that your top-level directory structure should look like this:

|--/path/to/your/benchmarks
|  |--OpenEQA
|  |  |--openeqa_scannet.parquet
|  |  |--openeqa_hm3d.parquet
|  |--RoboVQA
|     |--robovqa.parquet
|--/path/to/your/images
   |--openeqa_val
   |  |--scannet-v0
   |  |  |--002-scannet-scene0709_00
   |  |  |--xxx-scannet-scenexxxx_xx
   |  |--hm3d-v0
   |     |--000-hm3d-BFRyYbPCCPE
   |     |--xxx-hm3d-xxxxxxxxxxx
   |--robovqa_val
      |--robovqa_221911
      |--robovqa_xxxxxx

Step 4

Run script for evaluation

# Note: replace 'YOUR_API_KEY', 'YOUR_ENDPOINT', 'bmk_root', 'image_folder' with your own.
bash scripts/eval/eval_robo.sh /path/to/your/model

D. Evaluation for General Ability

Install the evaluation tool and execute the evaluation script:

pip install lmms-eval==0.2.0
PYTHONPATH=./ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m accelerate.commands.launch \
    --main_process_port=12444 \
    --num_processes=8 \
    -m lmms_eval \
    --model llava \
    --model_args pretrained=DeepGlint-AI/MLCD-Embodied-7B,conv_template=qwen_1_5 \
    --tasks mme \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix mlcd \
    --output_path ./eval_log/