Skip to content

Problems encountered when using distributed training #1

@WangLedi

Description

@WangLedi

Hi, thank you very much for the code. I recently tried to train this model, but I encountered some problems when enabling distributed training. When I tried to run this train_torch.py ​​file using the python command, I got the error KeyError: 'LOCAL_RANK'. There seems to be no LOCAL_RANK RANK WORLD_SIZE in my environment. I can only enable one GPU when I add them manually. Did you set some additional parameters to enable distributed training when training this model? Or do you have any insights into the problem I encountered? Thank you very much for your time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions