-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Hi,
Thank you very much for the great work, and for making your code publicly available.
I am trying to run the code to reproduce the results, however, the pre-training datasets are missing from the download script.
Is it possible to upload the pretraining data, similar to what you did for the fine-tuning ones last week?
In fact, I tried to use coco and vg datasets distributed by the UNITER code, while adjusting the train/val dataset in ./config/pretrain-alldata-base.json as follow:
{
"name": "coco_cap",
"db": [
"/path/to//uniter/txt_db/pretrain_coco_train.db/",
"/path/to//uniter/txt_db/pretrain_coco_val.db/"
],
"img": [
"/path/to//uniter/img_db/coco_train2014/",
"/path/to//uniter/img_db/coco_val2014/"
],
"tasks": [
"itm",
"mlm",
"mrfr",
"mrckl"
],
"mix_ratio": [
16,
8,
4,
4
]
},
{
"name": "vg_cap",
"db": [
"/path/to//uniter/txt_db/pretrain_vg_train.db/"
],
"img": [
"/path/to//uniter/img_db/vg/"
],
"tasks": [
"itm",
"mlm",
"mrfr",
"mrckl"
],
"mix_ratio": [
16,
12,
6,
6
]
}
],
"val_datasets": [
{
"name": "coco_cap",
"db": [
"/path/to//uniter/txt_db/pretrain_coco_val.db/"
],
"img": [
"/path/to//uniter/img_db/coco_val2014/"
],
"tasks": [
"itm",
"mlm",
"mrfr",
"mrckl"
]
},
{
"name": "vg_cap",
"db": [
"/path/to//uniter/txt_db/pretrain_vg_val.db/"
],
"img": [
"/path/to//uniter/img_db/vg/"
],
"tasks": [
"itm",
"mlm",
"mrfr",
"mrckl"
]
}
Surprisingly, the pretraining code worked, but I get another issue. I got gradient overflow at the beginning of the training and then this error at 3%: ZeroDivisionError: float division by zero
Here are some logs for gradient overflow
[1,2]<stdout>:Gradient overflow. Skipping step, loss scaler 5 reducing loss scale to 4.3601508761683463e-106
[1,1]<stdout>:Gradient overflow. Skipping step, loss scaler 5 reducing loss scale to 4.3601508761683463e-106
[1,3]<stdout>:Gradient overflow. Skipping step, loss scaler 5 reducing loss scale to 4.3601508761683463e-106
[1,0]<stdout>:Gradient overflow. Skipping step, loss scaler 5 reducing loss scale to 4.3601508761683463e-106
3%|▎ | 8792/300000 [2:51:23<79:18:44, 1.02it/s][1,1]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,0]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,3]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,2]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,3]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,2]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,1]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,0]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,1]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,2]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,0]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,3]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,3]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,2]<stdout>:Inf/Nan in loss/mrfr_coco_cap
and here is the log of the error:
[1,0]<stderr>:ZeroDivisionError: float division by zero
3%|▎ | 8856/300000 [2:52:34<94:33:17, 1.17s/it]--------------------------------------------------------
------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
I understand why this error is happening, the loss gradually gets smaller until it became 0. However, I can't understand what to do to solve this error? I looked at the issues in apex and it seems that I have bad input that is causing the issue. So my conclusion was that I am not using the correct pretraining dataset.
Can you please share the pretraining data?
Thanks