Skip to content

Distributed training doesn't work. #18

@dimeldo

Description

@dimeldo

At least using xlnet model. When using high max_len, it doesn't print any error just crashes. Training with 1 GPU works well. When setting low max_len I get the error below. I'm using 4 Nvidia V100.

Traceback (most recent call last):
  File "src/train.py", line 830, in <module>
    main()
  File "src/train.py", line 690, in main
    train_step(dummy_batch)
  File "src/train.py", line 566, in train_step
    loss, acc, ppl = forward_step(batch)
  File "src/train.py", line 556, in forward_step
    acc = reduce_tensor(acc)
  File "src/train.py", line 530, in reduce_tensor
    reduced = tensor.clone()
AttributeError: 'float' object has no attribute 'clone'
Traceback (most recent call last):
  File "src/train.py", line 830, in <module>
    main()
  File "src/train.py", line 690, in main
    train_step(dummy_batch)
  File "src/train.py", line 566, in train_step
    loss, acc, ppl = forward_step(batch)
  File "src/train.py", line 556, in forward_step
    acc = reduce_tensor(acc)
  File "src/train.py", line 530, in reduce_tensor
    reduced = tensor.clone()
AttributeError: 'float' object has no attribute 'clone'
Traceback (most recent call last):
  File "src/train.py", line 830, in <module>
Traceback (most recent call last):
  File "src/train.py", line 830, in <module>
    main()
  File "src/train.py", line 690, in main
    main()
  File "src/train.py", line 690, in main
    train_step(dummy_batch)
    train_step(dummy_batch)
  File "src/train.py", line 566, in train_step
  File "src/train.py", line 566, in train_step
    loss, acc, ppl = forward_step(batch)
  File "src/train.py", line 556, in forward_step
    loss, acc, ppl = forward_step(batch)
  File "src/train.py", line 556, in forward_step
    acc = reduce_tensor(acc)
  File "src/train.py", line 530, in reduce_tensor
    acc = reduce_tensor(acc)
  File "src/train.py", line 530, in reduce_tensor
    reduced = tensor.clone()
AttributeError: 'float' object has no attribute 'clone'
    reduced = tensor.clone()
AttributeError: 'float' object has no attribute 'clone'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions