Skip to content

Pretraining dataset #4

@ghaddarAbs

Description

@ghaddarAbs

Hi,

Thank you very much for the great work, and for making your code publicly available.
I am trying to run the code to reproduce the results, however, the pre-training datasets are missing from the download script.
Is it possible to upload the pretraining data, similar to what you did for the fine-tuning ones last week?

In fact, I tried to use coco and vg datasets distributed by the UNITER code, while adjusting the train/val dataset in ./config/pretrain-alldata-base.json as follow:

{
       "name": "coco_cap",
       "db": [
           "/path/to//uniter/txt_db/pretrain_coco_train.db/",
           "/path/to//uniter/txt_db/pretrain_coco_val.db/"
       ],
       "img": [
           "/path/to//uniter/img_db/coco_train2014/",
           "/path/to//uniter/img_db/coco_val2014/"
       ],
       "tasks": [
           "itm",
           "mlm",
           "mrfr",
           "mrckl"
       ],
       "mix_ratio": [
           16,
           8,
           4,
           4
       ]
   },
   {
       "name": "vg_cap",
       "db": [
           "/path/to//uniter/txt_db/pretrain_vg_train.db/"
       ],
       "img": [
           "/path/to//uniter/img_db/vg/"
       ],
       "tasks": [
           "itm",
           "mlm",
           "mrfr",
           "mrckl"
       ],
       "mix_ratio": [
           16,
           12,
           6,
           6
       ]
   }
],
"val_datasets": [
   {
       "name": "coco_cap",
       "db": [
           "/path/to//uniter/txt_db/pretrain_coco_val.db/"
       ],
       "img": [
           "/path/to//uniter/img_db/coco_val2014/"
       ],
       "tasks": [
           "itm",
           "mlm",
           "mrfr",
           "mrckl"
       ]
   },
   {
       "name": "vg_cap",
       "db": [
           "/path/to//uniter/txt_db/pretrain_vg_val.db/"
       ],
       "img": [
           "/path/to//uniter/img_db/vg/"
       ],
       "tasks": [
           "itm",
           "mlm",
           "mrfr",
           "mrckl"
       ]
   }


Surprisingly, the pretraining code worked, but I get another issue. I got gradient overflow at the beginning of the training and then this error at 3%: ZeroDivisionError: float division by zero

Here are some logs for gradient overflow

[1,2]<stdout>:Gradient overflow.  Skipping step, loss scaler 5 reducing loss scale to 4.3601508761683463e-106
[1,1]<stdout>:Gradient overflow.  Skipping step, loss scaler 5 reducing loss scale to 4.3601508761683463e-106
[1,3]<stdout>:Gradient overflow.  Skipping step, loss scaler 5 reducing loss scale to 4.3601508761683463e-106
[1,0]<stdout>:Gradient overflow.  Skipping step, loss scaler 5 reducing loss scale to 4.3601508761683463e-106
  3%|▎         | 8792/300000 [2:51:23<79:18:44,  1.02it/s][1,1]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,0]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,3]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,2]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,3]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,2]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,1]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,0]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,1]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,2]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,0]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,3]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,3]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,2]<stdout>:Inf/Nan in loss/mrfr_coco_cap

and here is the log of the error:

[1,0]<stderr>:ZeroDivisionError: float division by zero
  3%|▎         | 8856/300000 [2:52:34<94:33:17,  1.17s/it]--------------------------------------------------------
------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

I understand why this error is happening, the loss gradually gets smaller until it became 0. However, I can't understand what to do to solve this error? I looked at the issues in apex and it seems that I have bad input that is causing the issue. So my conclusion was that I am not using the correct pretraining dataset.

Can you please share the pretraining data?

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions