Optimal Learning Rate and Training Steps for Large Batch Size

First of all, thank you for sharing great work !

I was wondering how would you recommend choosing optimal hyperparams for large batch size ?

For example, if i train Electra Large model on v3-128 tpu, a batch size of 4096 is affordable. In this case, what `learning rate` and `training steps` would you suggest ?  As for the data, I'm planning to train the model with my own dataset, which is of ~ 300GB of tfrecords

Do you have any rough ideas ?

Thank you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimal Learning Rate and Training Steps for Large Batch Size #129

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimal Learning Rate and Training Steps for Large Batch Size #129

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions