Datasets

We will update this file as soon as possible！

We provide links to download our preprocessed dataset. If you would like to process the data on your own, we will soon provide scripts for you to do so.

Note that you should replace the image/video/speech file paths in the json files according to your storage path.

And please use your own file path to replace the original path in xllm/configs/datasets/*/*.yaml or xllm/projects/train/*.yaml

Image Interface

The pretraining datasets used in X-LLM are all publicly available. Here we provide the public links to these data, it is recommended that you download images pf the data from the links first, and then link the image paths with the downloaded dataset json (Chinese) we provided.

Dataset	Image	Data	Language
CC3M	Image Url	Data Json	ZH
MSCOCO	Image Url	Data Json	ZH
Visual Genome	Image Url	Data Json	ZH
Flickr30k	Image Url	Data Json	ZH
SBU	Image Url	Data Json	ZH
AI Challenger captions	Image Url	Data Json	ZH
Wukong captions	Image Url	Data Json	ZH

Please note that for the Wukong dataset, we filtered the first 50 million images using Chinese-CLIP (Vit-B-16 model) and only kept samples with a visual-textual similarity score greater than 0.475. Additionally, you will need to pair the captions with the corresponding images based on the image captions.

Data Format

[
    {
        "image": "train2014/COCO_train2014_000000013356.jpg",
        "caption": [
            "一个站在玻璃附近的白衣男子",
            "一个人在破旧的浴室里穿着防护服和面具",
            "一个人从头到脚穿着白色涂在房间里",
            "浴室正在装修，一个人在墙上画画",
            "一个穿着防护服的人在房间里工作"
        ],
        "image_id": "train2014/COCO_train2014_000000013356.jpg",
        "dataset": "coco_zh"
    },
]

or

[
    {
        "image": "/raid/cfl/en_pretraining/data/images/sbu/pythonDownload/subpic/5eda85e140.jpg",
        "caption": "谢菲尔德公园花园苏塞克斯湖边的老树",
        "image_id": "5eda85e140.jpg",
        "dataset": "sbu_zh"
    },
]

We do not use the item "image_id", which is the same as "image" most cases. Note that you should replace the image paths in the json files according to your storage path.

Speech Interface

We provide the public links to speech data (*.wav & feats), it is recommended that you download the data from the links first, and then link the speech data paths with the downloaded dataset json we provided.

Dataset	Audio/Features	Data	Language
AISHELL-2	Audio/Features	Data Json	ZH
VSDial-CN	Audio/Features	Data Json	ZH

Video Interface

The pretraining datasets used in X-LLM are all publicly available. Here we provide the public links to these data, it is recommended that you download video pf the data from the links first, and then link the video paths with the downloaded dataset json (Chinese) we provided.

Dataset	Video	Data
MSRVTT	Video Url	Data Json
ActivityNet	Video Url	Data Json

Evaluation

We provide the Chinese version of LLaVA test, which is an evaluation dataset with 30 unseen images is constructed: each image is assocaited with three types of instructions: conversation, detailed description and complex reasoning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_DATA.md

README_DATA.md

Datasets

Image Interface

Speech Interface

Video Interface

Evaluation

Files

README_DATA.md

Latest commit

History

README_DATA.md

File metadata and controls

Datasets

Image Interface

Speech Interface

Video Interface

Evaluation