Skip to content

When I was testing the pipedream code with version-updated torch, I encountered the following error (1.1.0 -> 1.11.0): #78

@lengien

Description

@lengien
  File "main_with_runtime_1.py", line 580, in <module>
    main()
  File "main_with_runtime_1.py", line 307, in main
    train(train_loader, r, optimizer, epoch)
  File "main_with_runtime_1.py", line 356, in train
    r.run_forward()
  File "../runtime_3.py", line 511, in run_forward
    self._run_forward(tensors)
  File "../runtime_3.py", line 559, in _run_forward
    for input_name in input_names])
  File "/opt/conda/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/pipeline/runtime/image_classification/models/alexnet/gpus=4_straight/stage2.py", line 25, in forward
    out5 = self.layer5(out4)
  File "/opt/conda/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/envs/torch/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 447, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/opt/conda/envs/torch/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 444, in _conv_forward
    self.padding, self.dilation, self.groups)
 (Triggered internally at  ../torch/csrc/autograd/python_anomaly_mode.cpp:104.)
  allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
  File "main_with_runtime_1.py", line 580, in <module>
    main()
  File "main_with_runtime_1.py", line 307, in main
    train(train_loader, r, optimizer, epoch)
  File "main_with_runtime_1.py", line 407, in train
    r.run_backward()
  File "../runtime_3.py", line 648, in run_backward
    for output_name in outputs]))
  File "/opt/conda/envs/torch/lib/python3.7/site-packages/torch/autograd/__init__.py", line 175, in backward
    allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [256, 256, 3, 3]] is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

The logic of Pipedream is that some stages will perform multiple forward passes before performing one backward pass. It seems that there may be issues with this in the new version of Torch. I would like to ask how to avoid this problem.

Versions
Collecting environment information...
PyTorch version: 1.11.0+cu115
Is debug build: False
CUDA used to build PyTorch: 11.5
ROCM used to build PyTorch: N/A

OS: Ubuntu 16.04.6 LTS (x86_64)
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
Clang version: Could not collect
CMake version: version 3.5.1
Libc version: glibc-2.17

Python version: 3.7.13 (default, Mar 29 2022, 02:18:16) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-4.15.0-204-generic-x86_64-with-debian-stretch-sid
Is CUDA available: True
CUDA runtime version: 10.1.163
CUDA_MODULE_LOADING set to:
GPU models and configuration:
GPU 0: Tesla P100-PCIE-12GB
GPU 1: Tesla P100-PCIE-12GB
GPU 2: Tesla P100-PCIE-12GB
GPU 3: Tesla P100-PCIE-12GB
GPU 4: Tesla P100-PCIE-12GB
GPU 5: Tesla P100-PCIE-12GB
GPU 6: Tesla P100-PCIE-12GB
GPU 7: Tesla P100-PCIE-12GB

Nvidia driver version: 515.65.01
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 20
On-line CPU(s) list: 0-19
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 20
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
Stepping: 1
CPU MHz: 2200.102
BogoMIPS: 4404.71
Hypervisor vendor: vertical
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 4096K
L3 cache: 16384K
NUMA node0 CPU(s): 0-19
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat

Versions of relevant libraries:
[pip3] msgpack-numpy==0.4.3.2
[pip3] numpy==1.21.5
[pip3] torch==1.11.0+cu115
[pip3] torchvision==0.12.0+cu115
[conda] cudatoolkit 10.2.89 hfd86e86_1
[conda] magma-cuda100 2.1.0 5 local
[conda] mkl 2019.1 144
[conda] mkl-include 2019.1 144
[conda] msgpack-numpy 0.4.3.2 py37_0
[conda] numpy 1.21.5 py37h7a5d4dd_2
[conda] numpy-base 1.21.5 py37hb8be1f0_2
[conda] torch 1.11.0+cu115 pypi_0 pypi
[conda] torchvision 0.12.0+cu115 pypi_0 pypi

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions