Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DocVQA #3

Open
MichaelHypS opened this issue Aug 6, 2024 · 15 comments
Open

DocVQA #3

MichaelHypS opened this issue Aug 6, 2024 · 15 comments

Comments

@MichaelHypS
Copy link

MichaelHypS commented Aug 6, 2024

Hi, great work and thanks for sharing the code and weights!

I tried the OCR on your sample and it works well. However, may I ask how could we go to perform VQA? Something similar to your paper example: "<VQA> What's the title?". Could you perhaps give us a snippet for this please?

@Veason-silverbullet
Copy link
Owner

Hi, @MichaelHypS. Thanks for your attention. I am very sorry that due to the computation resource limit, I have no plan to re-train the downstream VQA model which was trained with 8 V100 (32G). Here, I provide some instructions on how to fine-tune the VQA model.

Another tricky thing on dataset preparation is coding to obtain the answer bounding boxes with heuristic rules based on 1) ground-truth answers and 2) document OCR results, for which you may put some effort.

@MichaelHypS
Copy link
Author

MichaelHypS commented Aug 8, 2024

@Veason-silverbullet Thanks for the answer! Funny, finding the answer bounding box seemed to me the part that I was the least afraid of :) But I could imagine it will take me more time than anticipated.

Anyway, I decided to give it a shot and create my own VQA preprocessing script. As I like to understand the basics before adding any extra complexity, may I ask if the following seems alright to you so far?

I started by creating a small json file to put two questions from your GPT-4 example, such as:

`[

{
    "question": "<VQA> What's the title?",
    "word": "GPT-4 Technical Report",
    "bbox": [328, 49, 681, 81]
},
{
    "question": "<VQA> Who is the author(s)?",
    "word": "OpenAI*",
    "bbox": [470, 143, 545, 161]
}

]`

Then, I tried to mimic your preprocessing script by concatenating the questions and respective answers (word) after adding the special token for "<VQA>" (token: 50267). This lead to the following:

"<VQA> What's the title? GPT-4 Technical Report" -> [50267, 653, 18, 5, 1270, 116, 272, 10311, 12, 306, 12920, 2872]

and

"<VQA> Who is the author(s)? OpenAI*" -> [50267, 3394, 16, 5, 2730, 1640, 29, 26610, 2117, 15238, 3226]

Once this is done, I simply flatten each question-answer by adding for each the <start_token_id> (2), <locate_token_id> (50265) and the <eos> (2). Resulting in:

[2, 50267, 653, ... ,2872, 50265, 2, 2, 50267, 3394, 16, ... , 15238, 3226, 50265, 2, 1, 1, ... , 1]

or in other words:

[<start> <sequence0> <locate> <eos> <start> <sequence1> <locate> <eos> <padding>]

Does is seems correct to you?
Specifically,

  • is the sequence: [<locate> <eos> <start>] correct? I feel funny about the [2, 2]
  • the bbox is put only once at the <locate_token_id> position as well with a "locate_token_type"?
  • what would you recommend to do if for example the answer lies in two lines? Like at the end of a line and beginning of the next one.

Finally, perhaps a bit of a silly question, but I saw that you have a "ViTLPForDocVQA" model, should I use this one instead of the "ViTLPForPreTraining" in your finetune.py script ? Could you perhaps say something about their differences please?

@Veason-silverbullet
Copy link
Owner

Veason-silverbullet commented Aug 8, 2024

@MichaelHypS I appreciate you checking the code carefully. TLDR, your understanding is basically right. Based on my experience, any format of VQA fine-tuning sequence is OK if the fine-tuning and inference sequence patterns are consistent.

For each point of your questions, the following are my suggestions:

  • is the sequence: [<locate> <eos> <start>] correct? I feel funny about the [2, 2]

    My implementation is [<decoder_start_token_id> <question_sequence> <eos> <answer_sequence> <locate> <eos> <padding>]. Of course, your mentioned implementation is OK too (just need to keep the format consistent in inference).

    The bos_token (DECODER_START_TOKEN_ID = 2) is a legacy inherited from BART/T5 BPE tokenizer.

  • the bbox is put only once at the <locate_token_id> position as well with a "locate_token_type"?

    Yes, <locate_token_id> and <locate_token_type> appear at the same position, which is for pre-training loss computing.

  • what would you recommend to do if for example the answer lies in two lines? Like at the end of a line and beginning of the next one.

    Thanks for your accurate understanding and question. For such cases, I treat the whole area of these two lines as the answer bounding box.

  • I saw that you have a "ViTLPForDocVQA" model, should I use this one instead of the "ViTLPForPreTraining" in your finetune.py script

    Yes, use ViTLPForDocVQA instead of ViTLPForPreTraining. Also, please refer to the data loader at https://github.com/Veason-silverbullet/ViTLP/blob/main/dataset/docvqa.py.

I really appreciate your effort in checking the code. I decide to re-write the VQA data pre-process and fine-tune codes. However, since I am swamped on weekdays, I plan to do this job for the next two weekends. Please stay tuned.

@MichaelHypS
Copy link
Author

MichaelHypS commented Aug 12, 2024

@Veason-silverbullet I also appreciate your answers :) Thanks a lot!

So I made the changes to follow your sequences. I now understand better why the first DECODER_START_TOKEN_ID = 2, thanks. Therefore, I am currently looking into your "DocVQATrainDataset". I have one small questions about it. Does the <locate_token_id> part of the labels within the "qa_span"? Just to illustrate the question, let's take again my former example (but with your edits), I have:

<VQA> What's the title? GPT-4 Technical Report" -> [50267, 653, 18, 5, 1270, 116] + [2] + [272, 10311, 12, 306, 12920, 2872] + [50265] + [2]

My current understanding from looking at your code is that you construct a "qa_span" that would resemble the following:

[1, ..., 1] + [0] + [2, ...,2] + [0] + [0]

Where we would now set the question with the normal "word_token_type" and set the answer with the "answer_span_type". Similarly to the OCR training where the bboxes have also a special "localization_token_type". But I'm not sure if the <locate_token_id> should be part of it or not. My intuition would have actually put the entire answer, so all the way to the <eos> token within the label. Such as:

[1, ..., 1] + [0] + [2, ...,2] + [2] + [2]

But then this would be inconsistent with the OCR training script. I am asking because I have create a "qa_span" array for simplicity. But looking at your code, it seems that we could perhaps use your "token_type" array to encode every task together. Such that my final array could resemble

[1, ..., 1] + [0] + [3, ...,3] + [2] + [0]

if we keep the <locate_token_type> = 2 and set a new "<answer_token_type>" = 3 for example.

Since you're swamped, once I have something running I could also share my code. I only made quite some edits to yours as to use yaml file directly instead of the argparse. Therefore it's not entirely straightforward to use within your implementation.

@Veason-silverbullet
Copy link
Owner

Veason-silverbullet commented Aug 18, 2024

@MichaelHypS I have prepared the DocVQA fine-tuning code at https://github.com/Veason-silverbullet/ViTLP/tree/main/finetuning.

After fine-tuning, you may need to prepare the inference code, or I will provide it next weekend.

@MichaelHypS
Copy link
Author

Thanks a lot for the scripts! I see a couple difference with what I tried. Most notable that you are not concatenating all the qa togethers and also split the answer per bbx and token.

@SongDoHou
Copy link

Hi @Veason-silverbullet , thank you for sharing your work! It's really cool model.
Do you have any plan to share the DoCVQA model's weight file?
I look forward to inference code too! :)

@Veason-silverbullet
Copy link
Owner

Veason-silverbullet commented Aug 26, 2024

@MichaelHypS @SongDoHou , the DocVQA inference code is updated at https://github.com/Veason-silverbullet/ViTLP/blob/main/finetuning/inference_docvqa.py.

@SongDoHou Due to the company's policy, only the base model can be open-sourced. I have no plans to share DocVQA model's weight for the time being, as I am struggling with my work and I have no GPU resources to fine-tune the DocVQA model at the university. Nevertheless, I may rent GPUs to do it next month if I am free at that time.

@MichaelHypS
Copy link
Author

MichaelHypS commented Aug 28, 2024

Thanks a lot! It's nice to be able to compare your code with what I did.
I have a very small note for your code at line 196. I would suggest to change it to:
bboxes = bboxes.tolist() if bboxes is not None else []
Just so that if the first generated token is the EOS the code doesn't break, which actually happened to me with my small dummy training set.

I do have yet another question, I tried to train my model (with 2 extra token added to the dictionary size as you mentioned earlier) as well as your pipeline on my small dummy dataset to see if I can simply overfit it (2 images with 2 question on each) where I set a 1000 iterations. This is a simple sanity check to make sure that everything learns correctly. I noticed that both codes leads to a lm_loss starting at around 11 and then goes down and hoover around 2.6. I also tried to look at the generated tokens while training but honestly I can't tell much apart that both models seems to generate only locate_id tokens...
Anyway, after the training my pipeline leads to a single locate_token followed by the EOS, while yours create a single letter and then also generate the locate_token followed by the EOS. Therefore I am not that confident that the model has learned correctly. Do you have some inputs about this behavior and how I could make sure that my training works?

@Veason-silverbullet
Copy link
Owner

@MichaelHypS Thanks for your attention. I will provide the training logs and checkpoints for your reference next weekend.

@Veason-silverbullet
Copy link
Owner

Veason-silverbullet commented Sep 2, 2024

@MichaelHypS I've tested the following DocVQA fine-tuning script:

# Step 1: Clone/pull the latest code (updated on 01/09/2024)
git clone https://github.com/Veason-silverbullet/ViTLP.git
cd ViTLP
mkdir -p ckpts/ViTLP-medium
git clone https://huggingface.co/veason/ViTLP-medium ckpts/ViTLP-medium


# Step 2: Manually download DocVQA document images from https://rrc.cvc.uab.es/?ch=17&com=downloads
cd finetuning
# Download and extract DocVQA document images into ./DocVQA/documents from https://rrc.cvc.uab.es/?ch=17&com=downloads
ls ./DocVQA
# The `documents` should be located at `./DocVQA`
# bboxes-train-80.npy  images.txt  qa_span_types-train-80.npy  token_types-train-80.npy  train-mapping.txt    train_v1.0_withQT.json
# documents            link.py     test_v1.0.json              tokens-train-80.npy       train-metadata.json  val_v1.0_withQT.json


# Step 3: Fine-tuning DocVQA
# Effective batch size = num_nodes * num_gpus * batch_size * gradient_accumulation_steps
# Make `Effective batch size` to be 128, set `gradient_accumulation_steps` at `./misc/zero1_fp16-grad_acc-16.json` depending on your computation resources
# Since I only have 4 Nvidia-3090 (24G), I have to set gradient_accumulation_steps = 16.
nohup deepspeed --num_nodes 1 --num_gpus 4 finetune_docvqa.py --batch_size=2 --deepspeed_config=misc/zero1_fp16-grad_acc-16.json --output_dir=DocVQA-outputs > nohup.out 2>&1 &

Since I only have 4 Nvidia-3090 (24G) at hand, the fine-tuning takes ~6 days. I can only release the full training logs and checkpoints next week. Ideally, if 8 A100 are available, the fine-tuning can be done in hours setting gradient_accumulation_steps = 1.

@MichaelHypS
Copy link
Author

Thanks a lot your continuous support, really appreciated, I will try to run your script as well for sanity check and then mine. My setup is rather similar to yours so it may take a bit of time until your hear from me as well :)

@Veason-silverbullet
Copy link
Owner

Veason-silverbullet commented Sep 9, 2024

@MichaelHypS The DocVQA checkpoint is available at https://drive.google.com/drive/folders/1zZNw76DQTBPBv4Uuw-Bvuba_poYqc8ZK?usp=drive_link. Please feel free to have a shot.

Also, we have some important updates. Please pull the latest commit (10/09/24). The updates include

  • Increase fine-tuning resolution compared to the pre-training, which is a key to DocVQA performance.

  • Update fine-tuning data. Previously, some fine-tuning data was missing as the heuristic-rule code finetuning/DocVQA/link.py could not link some boxes to answers. I updated link.py and the fine-tuning data last week.

  • Please check the latest Readme for DocVQA inference instruction. The checkpoint provided above is for running the finetuning/inference_docvqa.py only. The checkpoint was fine-tuned with old data, its performance might be a little bit inferior. I will fine-tune it with new data this week (and put the official checkpoint to HuggingFace later).

@Veason-silverbullet
Copy link
Owner

Veason-silverbullet commented Sep 9, 2024

@MichaelHypS , for your request, the training loss curve is below
Screenshot 2024-09-10 at 2 12 51 AM
The training log is also provided below
log.zip

@MichaelHypS
Copy link
Author

Thanks a lot for the amazing work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants