MLCD-Embodied is comparable to 4v in terms of embodied capabilities and possesses excellent general capabilities. The detailed evaluation results are shown below.
MLCD-Embodied-7B | LLaVA OneVision-7B | GPT-4V | RoboMamba | ||
---|---|---|---|---|---|
RoboVQA | BLEU1 | 73.16 | 38.12 | - | 54.9 |
BLEU2 | 66.39 | 33.56 | - | 44.2 | |
BLEU3 | 60.61 | 31.76 | - | 39.5 | |
BLEU4 | 56.56 | 30.97 | - | 36.3 | |
OpenEQA | OBJECT-STATE-RECOGNITION | 71.83 | - | 63.2 | - |
OBJECT-RECOGNITION | 49.46 | - | 43.4 | - | |
FUNCTIONAL-REASONING | 54.38 | - | 57.4 | - | |
SPATIAL-UNDERSTANDING | 48.64 | - | 33.6 | - | |
ATTRIBUTE-RECOGNITION | 67.08 | - | 57.2 | - | |
WORLD-KNOWLEDGE | 53.87 | - | 50.7 | - | |
OBJECT-LOCALIZATION | 43.06 | - | 42.0 | - |
Dataset | Split | MLCD-Embodied-7B | LLaVA OneVision-7B | GPT-4v | GPT-4o |
---|---|---|---|---|---|
A12D | test | 79.9 | 81.4 | 78.2 | 94.2 |
ChartQA | test | 83.0 | 80.0 | 78.5 | 85.7 |
DocVQA | test | 91.6 | 87.5 | 88.4 | 92.8 |
InfoVQA | val | 73.9 | 70.7 | - | - |
InfoVQA | test | 70.0 | 68.8 | - | - |
MMMU | val | 47.3 | 48.8 | 56.8 | 69.1 |
MMStar | test | 58.5 | 61.7 | 57.1 | 63.9 |
OCRBench | - | 749.0 | 697.0 | 656.0 | 805.0 |
RealWorldQA | test | 68.9 | 66.3 | 61.4 | 58.6 |
SeedBench | image | 74.9 | 75.4 | 49.9 | 76.2 |
MMbench | en-dev | 81.1 | 83.2 | 81.3 | 83.4 |
MMbench | en-test | 80.1 | 80.8 | 75.0 | - |
MME | test | 578/1603 | 418/1580 | 517/1409 | - |
git clone https://github.com/deepglint/unicom
cd unicom
# Upgrade pip and install necessary dependencies
pip install --upgrade pip
pip install -e ".[train]"
CUDA_VISIBLE_DEVICES=0 python infer.py --model_dir /path/to/your/model
# example:
# >> Enter 'exit' to end the conversation, 'reset' to clear the chat history.
# >> Enter image file paths (comma-separated): ./asserts/logo.png
# >> User: <image>What kind of animal is it in this picture?
# >> Assistant: The image features a stylized representation of a cat, characterized by its vibrant and abstract depiction.
# >> User: What color is this cat?
# >> Assistant: The cat in the image is primarily white with blue, orange and pink accents, creating a visually appealing and unique appearance.
Download raw data following OpenEQA and RoboVQA(val part)
Converting raw data into the format required for model evaluation.
# convert OpenEQA benchmark. Note: replace the paths with your own.
python llava/benchmark/make_openeqa_bmk.py
# convert RoboVQA benchmark. Note: replace the paths with your own.
python llava/benchmark/make_robovqa_bmk.py
Make sure that your top-level directory structure should look like this:
|--/path/to/your/benchmarks
| |--OpenEQA
| | |--openeqa_scannet.parquet
| | |--openeqa_hm3d.parquet
| |--RoboVQA
| |--robovqa.parquet
|--/path/to/your/images
|--openeqa_val
| |--scannet-v0
| | |--002-scannet-scene0709_00
| | |--xxx-scannet-scenexxxx_xx
| |--hm3d-v0
| |--000-hm3d-BFRyYbPCCPE
| |--xxx-hm3d-xxxxxxxxxxx
|--robovqa_val
|--robovqa_221911
|--robovqa_xxxxxx
Run script for evaluation
# Note: replace 'YOUR_API_KEY', 'YOUR_ENDPOINT', 'bmk_root', 'image_folder' with your own.
bash scripts/eval/eval_robo.sh /path/to/your/model
Install the evaluation tool and execute the evaluation script:
pip install lmms-eval==0.2.0
PYTHONPATH=./ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m accelerate.commands.launch \
--main_process_port=12444 \
--num_processes=8 \
-m lmms_eval \
--model llava \
--model_args pretrained=DeepGlint-AI/MLCD-Embodied-7B,conv_template=qwen_1_5 \
--tasks mme \
--batch_size 1 \
--log_samples \
--log_samples_suffix mlcd \
--output_path ./eval_log/