中文 | English
-
This open-source project aims to train a small-parameter visual modal capable language model, MiniMind-V, from scratch within as fast as 3 hours.
-
MiniMind-V is extremely lightweight, with the smallest version being about
$\frac{1}{7000}$ the size of GPT-3, aiming to be quickly inferable and trainable even on personal GPUs. -
This is not only an implementation of an open-source model but also a tutorial for getting started with Visual Language Models (VLMs).
-
We hope this project can provide researchers with a starting example, helping everyone to get up to speed and generate more exploration and innovation in the VLM field.
To avoid misunderstanding, "from scratch" specifically refers to further developing the pure language model MiniMind (which is a fully from-scratch trained GPT-like model) with visual capabilities. For more details on the latter, please refer to the twin project MiniMind.
To avoid misunderstanding, "as fast as 3 hours" means you need to have a machine with a hardware configuration higher than mine. The detailed specifications will be provided below.
The demo has been deployed to ModelScope's creative space, where you can experience it on this website:
Visual Language Models (VLMs) like GPT-4V, Qwen-VL, LlaVA, etc., although impressive in performance, often require extremely high hardware configurations. For personal devices, not only is the GPU memory far from sufficient to support training, but even inference can be very difficult. We learn about the somewhat novel VLMs through reading papers or public account explanations, but often end up with a vague understanding. What we really need to know is: Is multimodal large models really as complex as imagined? What is their code implementation like? Is the training process really that difficult? Can I start training from scratch with just one 2080Ti GPU?
Through MiniMind-V, this project hopes to answer these questions and help researchers understand the core principles of visual language models under limited hardware conditions.
Tip
(As of 2024-10-04) The MiniMind-V series has completed pre-training of 2 model versions, requiring as little as 27M (0.027B) to have image recognition and dialogue capabilities!
Model (Size) | Tokenizer Length | Inference Usage | Release | Subjective Rating (/100) |
---|---|---|---|---|
minimind-v-v1-small (27M) | 6400 | 0.6 GB | 2024.10.04 | 50' |
minimind-v-v1 (109M) | 6400 | 1.1 GB | 2024.10.04 | 60' |
This analysis was conducted on 2×RTX 3090 GPUs with Torch 2.1.2, CUDA 12.2, and Flash Attention 2.
2024-10-05 (newest 🎉)
- MiniMind-V arrives as scheduled, first open-source release
This is my personal software and hardware configuration; adjust as necessary:
CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
Memory: 128 GB
GPU: NVIDIA GeForce RTX 3090(24GB) * 2
Environment: python 3.9 + Torch 2.1.2 + DDP single-machine multi-GPU training
- Ubuntu == 20.04
- Python == 3.9
- Pytorch == 2.1.2
- CUDA == 12.2
- requirements.txt
BTW: If you don't have Git LFS installed, please install it first with
sudo apt-get update
,sudo apt-get install git-lfs
.
-
- Clone the project
git clone https://github.com/jingyaogong/minimind-v cd minimind-v
-
- Install the environment
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
# Test if torch can use CUDA import torch print(torch.cuda.is_available())
If it is not available, please go to torch_stable to download the whl file for installation. Refer to this link
-
- Download the pre-trained model weights to the project root directory
minimind-v-v1
git clone https://huggingface.co/jingyaogong/minimind-v-v1
- Download the pre-trained model weights to the project root directory
-
- Download the pre-trained
clip-vit-base-patch32
model to themodel/clip_model
directory:
cd model/clip_model git clone https://hf-mirror.com/openai/clip-vit-base-patch32
- Download the pre-trained
-
- Start the chat web server for testing conversations
python web_server.py
BTW: If you don't have Git LFS installed, please install it first with
sudo apt-get update
,sudo apt-get install git-lfs
.
-
0.Clone the project code
git clone https://github.com/jingyaogong/minimind-v & cd minimind-v
-
1.Environment setup
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
# Test if torch can use CUDA import torch print(torch.cuda.is_available())
If it is not available, please go to torch_stable to download the whl file for installation. Refer to this link
-
2.Download the
clip-vit-base-patch32
model and place it in the./model/clip_model
directory:cd ./model/clip_model & git clone https://hf-mirror.com/openai/clip-vit-base-patch32
-
3.If you want to train it yourself
- 3.1 Download all contents of the dataset (Baidu Netdisk or HuggingFace) to the
./dataset
directory, and unzippretrain_images.zip
andsft_images.zip
- 3.2 Adjust the model parameters in
./model/LMConfig.py
Only need to adjust the dim and n_layers parameters, which are
(512+8)
or(768+16)
, corresponding tominimind-v-v1-small
andminimind-v-v1
- 3.3 Download the pre-trained weight file (Baidu Netdisk or HuggingFace) of the MiniMind language model and place it in the
./out/
directory, named*_llm.pth
- 3.4 Execute
python 1-pretrain_vlm.py
for pre-training, obtaining*_vlm_pretrain.pth
as the output weights - 3.5 Execute
python 2-sft_vlm.py
for instruction fine-tuning, obtaining*_vlm_sft.pth
as the output weights for fine-tuning - 3.6 Execute
python 2-sft_vlm.py --multi True
Perform multi-graph instruction fine-tuning based on instruction fine-tuning, resulting in*_vlm_sft_multi.pth
as the output weight for multi-graph instruction fine-tuning, with a memory usage of approximately 8198M for a 512+8 model.
- 3.1 Download all contents of the dataset (Baidu Netdisk or HuggingFace) to the
-
4.Test the inference effect of the self-trained model
- Ensure that the used, completed training parameter weights
*.pth
files are located in the./out/
directory - You can also directly download the completed model weight files and use the
*.pth
weight files I have trainedminimind-v/out ├── 512_llm.pth ├── 512_vlm_pretrain.pth ├── 512_vlm_sft.pth ├── 768_llm.pth ├── 768_vlm_pretrain.pth ├── 768_vlm_sft.pth
- Use
python 3-eval_chat.py
to test the conversation effect of the model, where the test images are in./dataset/eval_images
, and you can replace them as needed eval_chat - Use
python 3-eval_chat.py
Adjust the multi variable to test the model's multi-graph conversation effect. The test images are located in./dataset/eval_multi_images
, and you can replace them as needed (The multi-graph dataset is relatively small and contains English dialogues, with the dataset only including two-image comparison scenarios, so the fine-tuning effect is limited).
- Ensure that the used, completed training parameter weights
🍭 【Tip】Both pretraining and full-parameter instruction fine-tuning (pretrain and sft) support multi-GPU acceleration
-
Single machine N-card training launch (DDP)
torchrun --nproc_per_node N 1-pretrain_vlm.py # and torchrun --nproc_per_node N 2-sft_vlm.py
-
Record the training process
torchrun --nproc_per_node N 1-pretrain_vlm.py --use_wandb # and python 1-pretrain_vlm.py --use_wandb
By adding the
--use_wandb
parameter, you can record the training process, and after the training is complete, you can view the training process on the wandb website. You can specify the project name and run name by modifying thewandb_project
andwandb_run_name
parameters.
The base language model MiniMind (LLM) for MiniMind-V (VLM) comes from the twin project minimind. For specific details on the model architecture, training specifics, principles, and test results, please refer to the minimind project. To avoid redundancy, we will not discuss the LLM-related parts here, assuming you have a basic understanding of MiniMind (LLM).
PS: Even if you do not wish to delve into the details of MiniMind (LLM), you can directly refer to Quick Test and Quick Start to quickly test or train MiniMind-V. This will not be significantly affected.
The structure of MiniMind-V remains almost unchanged, with only two additional sub-modules added: Visual Encoder and feature projection, as well as a multimodal fusion branch, to support input from multiple modalities:
At this point, it's interesting to ponder two questions: What is a Large Language Model (LLM)? And what is a multimodal model?
-
This article perfectly articulates my thoughts, suggesting that the term LLM is quite inaccurate!
Although Large Language Models (LLMs) carry the word "language" in their name, they are actually not very related to language; this is merely a historical issue. A more accurate name would be autoregressive Transformer or something similar. LLMs are more of a general statistical modeling technique, primarily using autoregressive Transformers to simulate token streams, and these tokens can represent text, images, audio, action choices, or even molecules, among other things. Therefore, theoretically, any problem that can be framed as a process of simulating a series of discrete tokens can be addressed using LLMs. In fact, as the large language model technology stack matures, we may see an increasing number of problems being brought into this modeling paradigm. That is, the problem is fixed on using LLMs for 'predicting the next token', with the usage and meaning of tokens varying across different domains.
-
Professor Li Xi similarly corroborates my view (the exact wording is not recalled, but the gist is as follows):
Text, video, speech, and actions, which appear to humans as "multimodal" signals, are essentially just a classification concept for information storage by humans. Just like
.txt
and.png
files, although they differ in visual presentation and higher-level representation, there is no fundamental difference at their core. The notion of "multimodality" arises simply because of the human need to categorize these signals at different perceptual levels. However, for machines, regardless of the "modality" of the signal, they ultimately present as a string of binary "unimodal" digital sequences. Machines do not differentiate the modality source of these signals but rather process and analyze the information content carried by these sequences.
I personally believe that Generative Pretrained Transformer (GPT) is a more fitting term than Large Language Model (LLM), and thus I prefer to use "GPT" to represent LLM/VLM/GPT-like architectures, rather than to piggyback on OpenAI's popularity.
In summary, we can encapsulate what GPT does in one sentence: GPT models predict the next, and the next, and the next token... until the model outputs an end token; here, the "token" does not necessarily have to be text!
- For LLM models, if understanding "images" is required, we can treat "images" as a special kind of "foreign language" never seen before, translating them through a "foreign dictionary" into a special language input for the LLM.
- For LLM models, if understanding "audio" is required, we can treat "audio" as a special kind of "foreign language" never seen before, translating them through a "foreign dictionary" into a special language input for the LLM.
- ...
So, to get MiniMind-V, we only need to accomplish two things:
- Use a "foreign dictionary" proficient in translating images to translate the "foreign language" of images into the "LLM language" that the model can understand.
- Fine-tune the LLM so that it goes through a period of adjustment with the "foreign dictionary," thereby better understanding images.
The "foreign dictionary" is generally referred to as the Visual Encoder model. Similar to visual-language models such as LlaVA and Qwen-VL, MiniMind-V also selects open-source Clip series models as the Visual Encoder. Specifically, it uses clip-vit-base-patch32, a classic Visual Encoder based on the ViT-B/32 architecture, for describing image-text information. The input image size is 224x224, and since the patches are 32×32, it generates 7*7+1(cls_token)=50 tokens as input to the encoder layer, ultimately producing a 1×768 dimensional embedding vector for calculating error with text. We do not need the final embedding representation, so we only take the output of the encoder layer, which is the output features of the VIT backbone. In the code, this corresponds to the hook function in ./model/vision_utils.py's get_img_embedding. It retrieves the 50×768 dimensional features from the previous layer, which we then input as 50 visual tokens into MiniMind-V. There are also larger Clip models like clip-vit-large-patch14, which have a stronger image understanding capability, but a single image would generate 257 tokens, which, for a model of MiniMind's scale, would result in too long a context of image tokens, which is not conducive to training.
After obtaining the image encoder features, on one hand, it is necessary to align the 768-dimensional visual tokens with the text tokens of the LLM, on the other hand, the image features must be mapped to the same space as the text embeddings, i.e., the text tokens and the native visual tokens need to be aligned and cannot be treated equally; this can be called cross-modal feature alignment. LlaVA-1 accomplished this with a simple unbiased linear transformation, and the results were excellent; MiniMind-V does the same.
With this, the internal structural changes of MiniMind-V have been presented.
Next, we briefly discuss the changes in the external input and output of MiniMind-V.
The input for VLM is still a piece of text, which includes a special placeholder. After computing the text embedding, the vector generated by the image encoder can be projected into the part of the embedding corresponding to the placeholder, replacing the original placeholder embedding. For example:
<image>\nWhat is the content of this image?
minimind-v uses a 50-character <<<...>>>
placeholder to replace the image,
the reason for 50 characters was mentioned earlier:
any image is encoded by the clip model into 50×768 dimensional tokens.
Therefore, the prompt for minimind-v:
<<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>\nWhat is the description of this picture?
After computing the embedding and projection, and replacing the image part tokens, the entire computation process to output is no different from the LLM part.
The method of implementing multi-graph processing is achieved by injecting multiple <image>
placeholders, without the need to modify any framework.
ps: The only point worth noting is that if there are different numbers of images inserted in different conversations during training, you need to use empty features to pad the shorter features (corresponding to line 267 of the dataset) to ensure they can be read by the dataloader in the same size.
pps: This does not need to be done in the prompt; the placeholders are still injected based on the number of images inserted. Therefore, the input feature given to the LLM will not be affected by the padded features.
Considerations for Implementing Video Understanding Capabilities
For the video understanding capabilities of multi-modal large models, a feasible approach is to refer to the existing Python example for video understanding in MiniCPM-V 2.6. The main idea is to extract key frames from the video and then perform multi-graph inference. Therefore, if you want to add video understanding capabilities to MiniMind-V, you can build on the existing multi-graph training and refer to the method of extracting key frames in this Python script, then increase the number of images supported in the training files. The more MAX_NUM_FRAMES supported, the larger the memory consumption.
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
from decord import VideoReader, cpu # pip install decord
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number
def encode_video(video_path):
def uniform_sample(l, n):
gap = len(l) / n
idxs = [int(i * gap + gap / 2) for i in range(n)]
return [l[i] for i in idxs]
vr = VideoReader(video_path, ctx=cpu(0))
sample_fps = round(vr.get_avg_fps() / 1) # FPS
frame_idx = [i for i in range(0, len(vr), sample_fps)]
if len(frame_idx) > MAX_NUM_FRAMES:
frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
frames = vr.get_batch(frame_idx).asnumpy()
frames = [Image.fromarray(v.astype('uint8')) for v in frames]
print('num frames:', len(frames))
return frames
video_path="video_test.mp4"
frames = encode_video(video_path)
question = "Describe the video"
msgs = [
{'role': 'user', 'content': frames + [question]},
]
# Set decode params for video
params = {}
params["use_image_id"] = False
params["max_slice_nums"] = 2 # If cuda OOM and video resolution is greater than 448*448, set to 1
answer = model.chat(
image=None,
msgs=msgs,
tokenizer=tokenizer,
**params
)
print(answer)
With this, all the details of MiniMind-V have been presented.
The implementation of MiniMind-V did not reference any third-party code; it is based on MiniMind with minimal modifications, hence the code implementation is significantly different from models like LlaVA. The core changes between MiniMind-V and MiniMind do not exceed 100 lines, making it easy to get started.
Source: Chinese-LLaVA-Vision Contains approximately 600,000 pre-training images and <100,000 instruction fine-tuning images, derived from CC-3M and COCO 2014, with translated Q&A content for better support of the Chinese language. The dataset has been further resized, organized, and compressed.
Pre-training dataset format:
{
"id": "GCC_train_000644518",
"image": "GCC_train_000644518.jpg",
"conversations": [
{
"from": "human",
"value": "Write a brief but informative image summary.\n<image>"
},
{
"from": "gpt",
"value": "Adding saltwater to a metal pot with a black background, in slow motion fps"
}
]
}
Instruction fine-tuning dataset format:
{
"id": "000000334872",
"image": "000000334872.jpg",
"conversations": [
{
"from": "human",
"value": "<image>\nAre people in the photo going downhill skiing or cross-country skiing?"
},
{
"from": "gpt",
"value": "People in the photo are cross-country skiing in the forest because they are on a trail rather than on a steep slope."
}
]
}
Multi-graph Instruction Fine-tuning Dataset Format:
{
"id": "000000000000",
"image": "0.jpg, 1.jpg",
"conversations": [
{
"from": "human",
"value": "context: Source Image: <image> Target Image: <image> Instruction: What is the correct image edit instruction that can transform the source image to the target image?"
},
{
"from": "gpt",
"value": "take the people out of the back in the photo. Remove the two people behind the woman in the white dress and the man in the blue suit. remove people behind the couple in the centre"
}
]
}
{
"id": "000000013633",
"image": "27266.jpg, 27267.jpg",
"conversations": [
{
"from": "human",
"value": "<image>\n<image>\nName all the differences between these two birds."
},
{
"from": "gpt",
"value": "animal1 is brown with white tuft while animal2 is white with dark brown wings"
}
]
}
Notes:
- For instruction fine-tuning, only one round of conversation is retained, training a single-round conversation model to prevent small model performance from being dragged down by long texts.
- The multi-graph dataset is relatively small and contains English dialogues, with the dataset only including two-image comparison scenarios, so the fine-tuning effect is limited. Here, only a reference approach is provided.
Final dataset download link: Baidu Netdisk | HuggingFace multi_image_dataset: HuggingFace
The pre-training learns general knowledge about images, such as what a deer or dog is, from 595K datasets.
The instruction fine-tuning learns the real Q&A format for asking questions about images from 230K real dialogue datasets.
Two datasets are provided for multi-graph fine-tuning: the Image Transformation dataset and the Bird Comparison dataset, with lengths of 3.5k and 13.6k respectively, in real Q&A format.
1-pretrain_vlm.py
executes pre-training, yielding *_vlm_pretrain.pth
as the output weights of pre-training.
2-sft_vlm.py
performs instruction fine-tuning, resulting in *_vlm_sft.pth
as the output weights of instruction fine-tuning.
During training, the visual encoder, which is the CLIP model, is frozen, and only the Projection and LLM parts are fine-tuned.
Pretrain 512+8 Model (Training Time and Loss Reference Chart) Pretrain 768+16 Model (Training Time and Loss Reference Chart) SFT 512+8 Model (Training Time and Loss Reference Chart) SFT 768+16 Model (Training Time and Loss Reference Chart)
(.pth
weight files) Download link: Baidu Netdisk
(transformers
model files)
Download link: HuggingFace
Note: HuggingFace versions are all post-instruction fine-tuned MiniMind-V models
Model Name | params | Config | file_name |
---|---|---|---|
minimind-v-v1-small | 27M | d_model=512 n_layers=8 |
Pre-trained: 512_vllm_pretrain.pth Fine-tuned: 512_vllm_sft.pth |
minimind-v-v1 | 109M | d_model=768 n_layers=16 |
Pre-trained: 768_vllm_pretrain.pth Fine-tuned: 768_vllm_sft.pth |
Image1 | Image2 | 512_sft_multi |
---|---|---|
animal1 has a brown and black head with a black and white striped head . animal2 has a black head with a white stripe on its wings . |
python web_server.py
Based on the provided table data, the performance of the four models can be summarized as follows:
-
512_pretrain:
- Brief and inaccurate descriptions: Most descriptions fail to clearly convey the image content, often providing unrelated narratives. For example, the starfish image is described as "fossils in water," which is far off the mark.
- Lack of detail: In most cases, only simple, vague descriptions are given, failing to delve into the details or context of the image. For instance, for the tiger image, it simply says "looking at the camera in the water."
-
512_sft:
- More specific descriptions: Compared to 512_pretrain, 512_sft provides more detailed explanations and attempts to capture specific elements of the scene. For example, when describing the woman image, it mentions "suit" and "tie," giving a clearer depiction.
- Occasional errors or redundancy: Some descriptions are overly complex or even irrelevant to the image, such as mentioning seagulls, nesting, etc., in the dolphin image, which are not related.
-
768_pretrain:
- Incoherent information: The performance of this model is quite scattered, with descriptions often being vague and incomplete. For example, in describing the woman image, it only mentions "a human-made actor's adventure movie," without clearly explaining the image content.
- Partially accurate but with less overall information: Some descriptions, although relevant to the image, are very brief. For example, the starfish description only states "starfish and tentacles," lacking a full sense of the scene.
-
768_sft:
- Comprehensive and specific descriptions: This model's descriptions are the most detailed and precise among the four. For instance, when describing the bear image, it mentions "standing in an open field of grass, surrounded by trees and bushes, with a backpack," capturing multiple elements of the image.
- Stronger comprehension ability: This model can identify the scene and context of the image, providing reasonable interpretations and speculations. For example, describing a "family gathering" or "celebration" gives the image a more contextual connection.
- 512_pretrain performs the worst, with simple and inaccurate descriptions.
- 512_sft has improved detail in descriptions but occasionally includes irrelevant information.
- 768_pretrain has poor coherence in information, yet provides basic descriptions in some aspects.
- 768_sft performs the best, offering detailed, accurate descriptions, and is able to make good context-based inferences.
- Visual signals are a special kind of foreign language for LLMs, so the ability to "learn foreign languages" largely depends on the capabilities of the LLM.
- The stronger the performance of the LLM, the stronger the corresponding VLM will be, and the performance gain will be significant.
- Areas for improvement:
- Simpler projection methods for cross-modal feature alignment result in greater performance loss compared to Cross-Attention.
- Larger and more powerful large-series Clip models can be tried, using more fine-grained token representations for image features, which are currently very rough.
- The resolution is not high, theoretically only 224×224 (minimind-v dataset is set to 128×128 to save space).
- ...
Tip
If you find MiniMind-V
helpful, please add a ⭐ on GitHub
Due to the length and limited proficiency, there may be oversights; feel free to discuss corrections or submit PRs to improve the project on Issues
Your support is the driving force for continuous improvement of the project
@xinyanghuang7: 🔗Implemented complete multi-graph branch
Reference Links & Thanks to the following excellent papers or projects
- No particular order
- LlaVA
- LlaVA-VL
- Chinese-LLaVA-Vision-Instructions
This repository is licensed under the Apache-2.0 License.