Compute Resources.

I am training the t5 variant of model (grail_train_t5.jsonnet). 
The config says 
- training batch size 1 
- gradient accumulation of 8
- epochs 30

I am running on an A100 with 40 GB memory. It shows me a total training time of around 120 hours (with around 70 seconds per step).

Is this the expected training time? Is there any possibility of optimization? Can I increase batch size (will it affect performance)?

Edit : It shows 120 hours for one epoch (and not all). Am I making some mistake?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute Resources. #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Compute Resources. #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions