Pretraining dataset #4

ghaddarAbs · 2021-06-29T23:33:01Z

Hi,

Thank you very much for the great work, and for making your code publicly available.
I am trying to run the code to reproduce the results, however, the pre-training datasets are missing from the download script.
Is it possible to upload the pretraining data, similar to what you did for the fine-tuning ones last week?

In fact, I tried to use coco and vg datasets distributed by the UNITER code, while adjusting the train/val dataset in ./config/pretrain-alldata-base.json as follow:

{
       "name": "coco_cap",
       "db": [
           "/path/to//uniter/txt_db/pretrain_coco_train.db/",
           "/path/to//uniter/txt_db/pretrain_coco_val.db/"
       ],
       "img": [
           "/path/to//uniter/img_db/coco_train2014/",
           "/path/to//uniter/img_db/coco_val2014/"
       ],
       "tasks": [
           "itm",
           "mlm",
           "mrfr",
           "mrckl"
       ],
       "mix_ratio": [
           16,
           8,
           4,
           4
       ]
   },
   {
       "name": "vg_cap",
       "db": [
           "/path/to//uniter/txt_db/pretrain_vg_train.db/"
       ],
       "img": [
           "/path/to//uniter/img_db/vg/"
       ],
       "tasks": [
           "itm",
           "mlm",
           "mrfr",
           "mrckl"
       ],
       "mix_ratio": [
           16,
           12,
           6,
           6
       ]
   }
],
"val_datasets": [
   {
       "name": "coco_cap",
       "db": [
           "/path/to//uniter/txt_db/pretrain_coco_val.db/"
       ],
       "img": [
           "/path/to//uniter/img_db/coco_val2014/"
       ],
       "tasks": [
           "itm",
           "mlm",
           "mrfr",
           "mrckl"
       ]
   },
   {
       "name": "vg_cap",
       "db": [
           "/path/to//uniter/txt_db/pretrain_vg_val.db/"
       ],
       "img": [
           "/path/to//uniter/img_db/vg/"
       ],
       "tasks": [
           "itm",
           "mlm",
           "mrfr",
           "mrckl"
       ]
   }

Surprisingly, the pretraining code worked, but I get another issue. I got gradient overflow at the beginning of the training and then this error at 3%: ZeroDivisionError: float division by zero

Here are some logs for gradient overflow

[1,2]<stdout>:Gradient overflow.  Skipping step, loss scaler 5 reducing loss scale to 4.3601508761683463e-106
[1,1]<stdout>:Gradient overflow.  Skipping step, loss scaler 5 reducing loss scale to 4.3601508761683463e-106
[1,3]<stdout>:Gradient overflow.  Skipping step, loss scaler 5 reducing loss scale to 4.3601508761683463e-106
[1,0]<stdout>:Gradient overflow.  Skipping step, loss scaler 5 reducing loss scale to 4.3601508761683463e-106
  3%|▎         | 8792/300000 [2:51:23<79:18:44,  1.02it/s][1,1]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,0]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,3]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,2]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,3]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,2]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,1]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,0]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,1]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,2]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,0]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,3]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,3]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,2]<stdout>:Inf/Nan in loss/mrfr_coco_cap

and here is the log of the error:

[1,0]<stderr>:ZeroDivisionError: float division by zero
  3%|▎         | 8856/300000 [2:52:34<94:33:17,  1.17s/it]--------------------------------------------------------
------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

I understand why this error is happening, the loss gradually gets smaller until it became 0. However, I can't understand what to do to solve this error? I looked at the issues in apex and it seems that I have bad input that is causing the issue. So my conclusion was that I am not using the correct pretraining dataset.

Can you please share the pretraining data?

Thanks

The text was updated successfully, but these errors were encountered:

intersun · 2021-06-30T21:39:02Z

This never happened to us before so I am not quite sure what is happening. few suggestions

(1) Can you plot the loss curve and see if it is going down before it collapse?
(2) If you run it multiple times, does it always happen?
(3) Another option is to run without fp16, and see it could run successfully
(4) Maybe you could try lower the learning rate, and see if it works.

Please keep us updated.

Thanks

ghaddarAbs · 2021-06-30T22:19:25Z

Thank you @intersun for your response.

(1) I will print the lr curve to see what happend.
(2) Yes, we tried it multiple time and it always happen.
(3) We run it without fp32 and still gets a lots of Inf/Nan in loss/ before it crashes.
(4) We tried with low learning rate but it didn't work.

ghaddarAbs · 2021-06-30T22:33:23Z

Also, I get these logs just before loading the data (pretrain.py), I don't know if they are related with the issue we are getting.


[1,0]<stderr>:Weights of BertEncoder not initialized from pretrained model: ['encode_proj.0.weight', 'encode_proj.
0.bias', 'encode_proj.2.weight', 'encode_proj.2.bias', 'encode_proj.3.weight', 'encode_proj.3.bias']              [1,0]<stderr>:Weights from pretrained model not used in BertEncoder: ['cls.predictions.bias', 'cls.predictions.tra
nsform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relations
hip.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
[1,1]<stderr>:Weights of BertEncoder not initialized from pretrained model: ['encode_proj.0.weight', 'encode_proj.
0.bias', 'encode_proj.2.weight', 'encode_proj.2.bias', 'encode_proj.3.weight', 'encode_proj.3.bias']
[1,1]<stderr>:Weights from pretrained model not used in BertEncoder: ['cls.predictions.bias', 'cls.predictions.tra
nsform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relations
hip.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform
.LayerNorm.bias']
[1,3]<stderr>:Weights of BertEncoder not initialized from pretrained model: ['encode_proj.0.weight', 'encode_proj.
0.bias', 'encode_proj.2.weight', 'encode_proj.2.bias', 'encode_proj.3.weight', 'encode_proj.3.bias']
[1,3]<stderr>:Weights from pretrained model not used in BertEncoder: ['cls.predictions.bias', 'cls.predictions.tra
nsform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relations
hip.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform
.LayerNorm.bias']

][1,2]<stderr>:Weights of BertEncoder not initialized from pretrained model: ['encode_proj.0.weight', 'encode_proj.
0.bias', 'encode_proj.2.weight', 'encode_proj.2.bias', 'encode_proj.3.weight', 'encode_proj.3.bias']
[1,2]<stderr>:Weights from pretrained model not used in BertEncoder: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relations
hip.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform
.LayerNorm.bias']

intersun · 2021-10-28T07:20:56Z

@ChenRocks Can you help @ghaddarAbs verify this ZeroDivisionError? Did this also happen to UNITER pretraining? In my pre-training this never happened :(

ChenRocks · 2021-11-01T06:03:26Z

You should not see the apex loss scaler reducing the loss scale to less than 1.

[1,0]<stdout>:Gradient overflow.  Skipping step, loss scaler 5 reducing loss scale to 4.3601508761683463e-106

The training probably went wrong way earlier then the ZeroDivisionError.

The data downloaded from UNITER should be compatible with this repo. The only difference is the name change. In UNITER/LightningDOT you should never see this loss scaler error if you follow the original code/config. In my other projects, I have seen this issue becase I used some fp16-unsafe layer (nn.BCELoss) and changing it to the fp16-safe variant (nn.BCEWithLogitsLoss) fixed it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pretraining dataset #4

Pretraining dataset #4

ghaddarAbs commented Jun 29, 2021

intersun commented Jun 30, 2021

ghaddarAbs commented Jun 30, 2021

ghaddarAbs commented Jun 30, 2021

intersun commented Oct 28, 2021

ChenRocks commented Nov 1, 2021

Pretraining dataset #4

Pretraining dataset #4

Comments

ghaddarAbs commented Jun 29, 2021

intersun commented Jun 30, 2021

ghaddarAbs commented Jun 30, 2021

ghaddarAbs commented Jun 30, 2021

intersun commented Oct 28, 2021

ChenRocks commented Nov 1, 2021