We will update this file as soon as possible!
We provide links to download our preprocessed dataset. If you would like to process the data on your own, we will soon provide scripts for you to do so.
Note that you should replace the image/video/speech file paths in the json files according to your storage path.
And please use your own file path to replace the original path in xllm/configs/datasets/*/*.yaml or xllm/projects/train/*.yaml
The pretraining datasets used in X-LLM are all publicly available. Here we provide the public links to these data, it is recommended that you download images pf the data from the links first, and then link the image paths with the downloaded dataset json (Chinese) we provided.
Dataset | Image | Data | Language |
---|---|---|---|
CC3M | Image Url | Data Json | ZH |
MSCOCO | Image Url | Data Json | ZH |
Visual Genome | Image Url | Data Json | ZH |
Flickr30k | Image Url | Data Json | ZH |
SBU | Image Url | Data Json | ZH |
AI Challenger captions | Image Url | Data Json | ZH |
Wukong captions | Image Url | Data Json | ZH |
Please note that for the Wukong dataset, we filtered the first 50 million images using Chinese-CLIP (Vit-B-16 model) and only kept samples with a visual-textual similarity score greater than 0.475. Additionally, you will need to pair the captions with the corresponding images based on the image captions.
Data Format
[
{
"image": "train2014/COCO_train2014_000000013356.jpg",
"caption": [
"一个站在玻璃附近的白衣男子",
"一个人在破旧的浴室里穿着防护服和面具",
"一个人从头到脚穿着白色涂在房间里",
"浴室正在装修,一个人在墙上画画",
"一个穿着防护服的人在房间里工作"
],
"image_id": "train2014/COCO_train2014_000000013356.jpg",
"dataset": "coco_zh"
},
]
or
[
{
"image": "/raid/cfl/en_pretraining/data/images/sbu/pythonDownload/subpic/5eda85e140.jpg",
"caption": "谢菲尔德公园花园苏塞克斯湖边的老树",
"image_id": "5eda85e140.jpg",
"dataset": "sbu_zh"
},
]
We do not use the item "image_id", which is the same as "image" most cases. Note that you should replace the image paths in the json files according to your storage path.
We provide the public links to speech data (*.wav & feats), it is recommended that you download the data from the links first, and then link the speech data paths with the downloaded dataset json we provided.
Dataset | Audio/Features | Data | Language |
---|---|---|---|
AISHELL-2 | Audio/Features | Data Json | ZH |
VSDial-CN | Audio/Features | Data Json | ZH |
The pretraining datasets used in X-LLM are all publicly available. Here we provide the public links to these data, it is recommended that you download video pf the data from the links first, and then link the video paths with the downloaded dataset json (Chinese) we provided.
Dataset | Video | Data |
---|---|---|
MSRVTT | Video Url | Data Json |
ActivityNet | Video Url | Data Json |
We provide the Chinese version of LLaVA test, which is an evaluation dataset with 30 unseen images is constructed: each image is assocaited with three types of instructions: conversation, detailed description and complex reasoning.