Skip to content

Staged batchsize training #80

@ClashLuke

Description

@ClashLuke

Some papers such as "Don't Decay the Learning Rate, Increase the Batch Size" have shown that training with progressively larger batch sizes instead of progressively lower learning rates helps models find a better local minimum by improving stability in the final stages of training. Additionally, this increases training speed, as the model gets progressively faster (in tokens/s) with increasing batch size.
Intuitively, this allows the model to take many small updates initially, as all samples in the batch will point in a similar direction. However, during later stages of the training, the gradients might point in different directions, so larger batches (or lower learning rates) are required.

Metadata

Metadata

Assignees

No one assigned

    Labels

    coreImproves core model while keeping core idea intactengineeringSoftware-engineering problems that don't require ML-ExpertiseresearchCreative project that might fail but could give high returns

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions