Pretraining dataset

Hi, 

Thank you very much for the great work, and for making your code publicly available.
I am trying to run the code to reproduce the results, however, the pre-training datasets are missing from the download script.
Is it possible to upload the pretraining data, similar to what you did for the fine-tuning ones last week?

In fact, I tried to use `coco` and `vg` datasets distributed by the [UNITER](https://github.com/ChenRocks/UNITER) code, while adjusting the train/val dataset in  `./config/pretrain-alldata-base.json` as follow:

```
{
       "name": "coco_cap",
       "db": [
           "/path/to//uniter/txt_db/pretrain_coco_train.db/",
           "/path/to//uniter/txt_db/pretrain_coco_val.db/"
       ],
       "img": [
           "/path/to//uniter/img_db/coco_train2014/",
           "/path/to//uniter/img_db/coco_val2014/"
       ],
       "tasks": [
           "itm",
           "mlm",
           "mrfr",
           "mrckl"
       ],
       "mix_ratio": [
           16,
           8,
           4,
           4
       ]
   },
   {
       "name": "vg_cap",
       "db": [
           "/path/to//uniter/txt_db/pretrain_vg_train.db/"
       ],
       "img": [
           "/path/to//uniter/img_db/vg/"
       ],
       "tasks": [
           "itm",
           "mlm",
           "mrfr",
           "mrckl"
       ],
       "mix_ratio": [
           16,
           12,
           6,
           6
       ]
   }
],
"val_datasets": [
   {
       "name": "coco_cap",
       "db": [
           "/path/to//uniter/txt_db/pretrain_coco_val.db/"
       ],
       "img": [
           "/path/to//uniter/img_db/coco_val2014/"
       ],
       "tasks": [
           "itm",
           "mlm",
           "mrfr",
           "mrckl"
       ]
   },
   {
       "name": "vg_cap",
       "db": [
           "/path/to//uniter/txt_db/pretrain_vg_val.db/"
       ],
       "img": [
           "/path/to//uniter/img_db/vg/"
       ],
       "tasks": [
           "itm",
           "mlm",
           "mrfr",
           "mrckl"
       ]
   }


```

Surprisingly, the pretraining code worked, but I get another issue.  I got gradient overflow at the beginning of the training and then this error at 3%: **ZeroDivisionError: float division by zero**

Here are some logs for gradient overflow

```
[1,2]<stdout>:Gradient overflow.  Skipping step, loss scaler 5 reducing loss scale to 4.3601508761683463e-106
[1,1]<stdout>:Gradient overflow.  Skipping step, loss scaler 5 reducing loss scale to 4.3601508761683463e-106
[1,3]<stdout>:Gradient overflow.  Skipping step, loss scaler 5 reducing loss scale to 4.3601508761683463e-106
[1,0]<stdout>:Gradient overflow.  Skipping step, loss scaler 5 reducing loss scale to 4.3601508761683463e-106
  3%|▎         | 8792/300000 [2:51:23<79:18:44,  1.02it/s][1,1]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,0]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,3]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,2]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,3]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,2]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,1]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,0]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,1]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,2]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,0]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,3]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,3]<stdout>:Inf/Nan in loss/mrfr_coco_cap
[1,2]<stdout>:Inf/Nan in loss/mrfr_coco_cap
```

and here is the log of the error:

```
[1,0]<stderr>:ZeroDivisionError: float division by zero
  3%|▎         | 8856/300000 [2:52:34<94:33:17,  1.17s/it]--------------------------------------------------------
------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
```

I understand why this error is happening, the loss gradually gets smaller until it became 0. However, I can't understand what to do to solve this error? I looked at the issues in [apex](https://github.com/NVIDIA/apex/issues/318) and it seems that I have bad input that is causing the issue. So my conclusion was that I am not using the correct pretraining dataset. 

Can you please share the pretraining data?

Thanks





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pretraining dataset #4

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Pretraining dataset #4

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions