-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DocVQA #3
Comments
Hi, @MichaelHypS. Thanks for your attention. I am very sorry that due to the computation resource limit, I have no plan to re-train the downstream VQA model which was trained with 8 V100 (32G). Here, I provide some instructions on how to fine-tune the VQA model.
Another tricky thing on dataset preparation is coding to obtain the answer bounding boxes with heuristic rules based on 1) ground-truth answers and 2) document OCR results, for which you may put some effort. |
@Veason-silverbullet Thanks for the answer! Funny, finding the answer bounding box seemed to me the part that I was the least afraid of :) But I could imagine it will take me more time than anticipated. Anyway, I decided to give it a shot and create my own VQA preprocessing script. As I like to understand the basics before adding any extra complexity, may I ask if the following seems alright to you so far? I started by creating a small json file to put two questions from your GPT-4 example, such as: `[
]` Then, I tried to mimic your preprocessing script by concatenating the questions and respective answers (word) after adding the special token for "<VQA>" (token: 50267). This lead to the following:
and
Once this is done, I simply flatten each question-answer by adding for each the <start_token_id> (2), <locate_token_id> (50265) and the <eos> (2). Resulting in:
or in other words:
Does is seems correct to you?
Finally, perhaps a bit of a silly question, but I saw that you have a "ViTLPForDocVQA" model, should I use this one instead of the "ViTLPForPreTraining" in your finetune.py script ? Could you perhaps say something about their differences please? |
@MichaelHypS I appreciate you checking the code carefully. TLDR, your understanding is basically right. Based on my experience, any format of VQA fine-tuning sequence is OK if the fine-tuning and inference sequence patterns are consistent. For each point of your questions, the following are my suggestions:
I really appreciate your effort in checking the code. I decide to re-write the VQA data pre-process and fine-tune codes. However, since I am swamped on weekdays, I plan to do this job for the next two weekends. Please stay tuned. |
@Veason-silverbullet I also appreciate your answers :) Thanks a lot! So I made the changes to follow your sequences. I now understand better why the first DECODER_START_TOKEN_ID = 2, thanks. Therefore, I am currently looking into your "DocVQATrainDataset". I have one small questions about it. Does the <locate_token_id> part of the labels within the "qa_span"? Just to illustrate the question, let's take again my former example (but with your edits), I have:
My current understanding from looking at your code is that you construct a "qa_span" that would resemble the following:
Where we would now set the question with the normal "word_token_type" and set the answer with the "answer_span_type". Similarly to the OCR training where the bboxes have also a special "localization_token_type". But I'm not sure if the <locate_token_id> should be part of it or not. My intuition would have actually put the entire answer, so all the way to the <eos> token within the label. Such as:
But then this would be inconsistent with the OCR training script. I am asking because I have create a "qa_span" array for simplicity. But looking at your code, it seems that we could perhaps use your "token_type" array to encode every task together. Such that my final array could resemble
if we keep the <locate_token_type> = 2 and set a new "<answer_token_type>" = 3 for example. Since you're swamped, once I have something running I could also share my code. I only made quite some edits to yours as to use yaml file directly instead of the argparse. Therefore it's not entirely straightforward to use within your implementation. |
@MichaelHypS I have prepared the DocVQA fine-tuning code at https://github.com/Veason-silverbullet/ViTLP/tree/main/finetuning.
After fine-tuning, you may need to prepare the inference code, or I will provide it next weekend. |
Thanks a lot for the scripts! I see a couple difference with what I tried. Most notable that you are not concatenating all the qa togethers and also split the answer per bbx and token. |
Hi @Veason-silverbullet , thank you for sharing your work! It's really cool model. |
@MichaelHypS @SongDoHou , the DocVQA inference code is updated at https://github.com/Veason-silverbullet/ViTLP/blob/main/finetuning/inference_docvqa.py. @SongDoHou Due to the company's policy, only the base model can be open-sourced. I have no plans to share DocVQA model's weight for the time being, as I am struggling with my work and I have no GPU resources to fine-tune the DocVQA model at the university. Nevertheless, I may rent GPUs to do it next month if I am free at that time. |
Thanks a lot! It's nice to be able to compare your code with what I did. I do have yet another question, I tried to train my model (with 2 extra token added to the dictionary size as you mentioned earlier) as well as your pipeline on my small dummy dataset to see if I can simply overfit it (2 images with 2 question on each) where I set a 1000 iterations. This is a simple sanity check to make sure that everything learns correctly. I noticed that both codes leads to a lm_loss starting at around 11 and then goes down and hoover around 2.6. I also tried to look at the generated tokens while training but honestly I can't tell much apart that both models seems to generate only locate_id tokens... |
@MichaelHypS Thanks for your attention. I will provide the training logs and checkpoints for your reference next weekend. |
@MichaelHypS I've tested the following DocVQA fine-tuning script: # Step 1: Clone/pull the latest code (updated on 01/09/2024) git clone https://github.com/Veason-silverbullet/ViTLP.git cd ViTLP mkdir -p ckpts/ViTLP-medium git clone https://huggingface.co/veason/ViTLP-medium ckpts/ViTLP-medium # Step 2: Manually download DocVQA document images from https://rrc.cvc.uab.es/?ch=17&com=downloads cd finetuning # Download and extract DocVQA document images into ./DocVQA/documents from https://rrc.cvc.uab.es/?ch=17&com=downloads ls ./DocVQA # The `documents` should be located at `./DocVQA` # bboxes-train-80.npy images.txt qa_span_types-train-80.npy token_types-train-80.npy train-mapping.txt train_v1.0_withQT.json # documents link.py test_v1.0.json tokens-train-80.npy train-metadata.json val_v1.0_withQT.json # Step 3: Fine-tuning DocVQA # Effective batch size = num_nodes * num_gpus * batch_size * gradient_accumulation_steps # Make `Effective batch size` to be 128, set `gradient_accumulation_steps` at `./misc/zero1_fp16-grad_acc-16.json` depending on your computation resources # Since I only have 4 Nvidia-3090 (24G), I have to set gradient_accumulation_steps = 16. nohup deepspeed --num_nodes 1 --num_gpus 4 finetune_docvqa.py --batch_size=2 --deepspeed_config=misc/zero1_fp16-grad_acc-16.json --output_dir=DocVQA-outputs > nohup.out 2>&1 & Since I only have 4 Nvidia-3090 (24G) at hand, the fine-tuning takes ~6 days. I can only release the full training logs and checkpoints next week. Ideally, if 8 A100 are available, the fine-tuning can be done in hours setting |
Thanks a lot your continuous support, really appreciated, I will try to run your script as well for sanity check and then mine. My setup is rather similar to yours so it may take a bit of time until your hear from me as well :) |
@MichaelHypS The DocVQA checkpoint is available at https://drive.google.com/drive/folders/1zZNw76DQTBPBv4Uuw-Bvuba_poYqc8ZK?usp=drive_link. Please feel free to have a shot. Also, we have some important updates. Please pull the latest commit (10/09/24). The updates include
|
@MichaelHypS , for your request, the training loss curve is below |
Thanks a lot for the amazing work! |
Hi, great work and thanks for sharing the code and weights!
I tried the OCR on your sample and it works well. However, may I ask how could we go to perform VQA? Something similar to your paper example: "<VQA> What's the title?". Could you perhaps give us a snippet for this please?
The text was updated successfully, but these errors were encountered: