Skip to content

pytorch cifar example doesn't quit gracefully #47

@yaroslavvb

Description

@yaroslavvb

Right now pytorch-cifar, single p3.16xlarge ends last epoch with following error coming from all training processes

cc @bearpelican

terminate called after throwing an instance of 'gloo::EnforceNotMet'
  what():  [enforce fail at /opt/conda/conda-bld/pytorch_1524586445097/work/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /opt/conda/conda-bld/pytorch_1524586445097/work/third_party/gloo/gloo/cuda_private.h:40: driver shutting down

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions