-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pretraining dataset #4
Comments
This never happened to us before so I am not quite sure what is happening. few suggestions (1) Can you plot the loss curve and see if it is going down before it collapse? Please keep us updated. Thanks |
Thank you @intersun for your response. (1) I will print the lr curve to see what happend. |
Also, I get these logs just before loading the data (pretrain.py), I don't know if they are related with the issue we are getting.
|
@ChenRocks Can you help @ghaddarAbs verify this ZeroDivisionError? Did this also happen to UNITER pretraining? In my pre-training this never happened :( |
You should not see the apex loss scaler reducing the loss scale to less than 1.
The training probably went wrong way earlier then the ZeroDivisionError. The data downloaded from UNITER should be compatible with this repo. The only difference is the name change. In UNITER/LightningDOT you should never see this loss scaler error if you follow the original code/config. In my other projects, I have seen this issue becase I used some fp16-unsafe layer ( |
Hi,
Thank you very much for the great work, and for making your code publicly available.
I am trying to run the code to reproduce the results, however, the pre-training datasets are missing from the download script.
Is it possible to upload the pretraining data, similar to what you did for the fine-tuning ones last week?
In fact, I tried to use
coco
andvg
datasets distributed by the UNITER code, while adjusting the train/val dataset in./config/pretrain-alldata-base.json
as follow:Surprisingly, the pretraining code worked, but I get another issue. I got gradient overflow at the beginning of the training and then this error at 3%: ZeroDivisionError: float division by zero
Here are some logs for gradient overflow
and here is the log of the error:
I understand why this error is happening, the loss gradually gets smaller until it became 0. However, I can't understand what to do to solve this error? I looked at the issues in apex and it seems that I have bad input that is causing the issue. So my conclusion was that I am not using the correct pretraining dataset.
Can you please share the pretraining data?
Thanks
The text was updated successfully, but these errors were encountered: