Skip to content

how to launch ddp #1

@X1nyuLu

Description

@X1nyuLu

Dear developer:
Hello! This is a very excellent work. I'm recently reproducing the training process of your model. I wonder if it's possible to enable data parallelism in the code. When I tried to use two GPUs, an error occurred, as shown below.
Could you please tell me what changes I need to make to the existing code? Thank you.

wandb: W&B syncing is set to `offline` in this directory. Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
No of GPUs available 2
121944it [00:05, 22267.63it/s]
121944it [00:05, 22510.61it/s]
Normalizing each spectrum individually
wandb: logging graph, to disable use `wandb.watch(log_graph=False)`
SMILES WILL BE RANDOMIZED
no of batches  250
no of batches  5
no of batches  50
Starting Training
Traceback (most recent call last):/250 | Loss: 5.1767811775207525
  File "/fs-computility/ai4chem/luxinyu.p/Spectra2Structure/run.py", line 92, in run
    train_total(config, model, dataloaders, optimizer, loss_fn, logs, 000,1000)
  File "/fs-computility/ai4chem/luxinyu.p/Spectra2Structure/train_utils.py", line 308, in train_total
    vl = validate(config, model, dataloaders['val'], epoch, optimizer, loss_fn )
  File "/fs-computility/ai4chem/luxinyu.p/Spectra2Structure/train_utils.py", line 160, in validate
    mol_latents, spec_latents, smile_preds, logit_scale, ids = model(data)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 186, in forward
    outputs = self.parallel_apply(replicas, inputs, module_kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 201, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 109, in parallel_apply
    output.reraise()
  File "/root/miniconda3/lib/python3.10/site-packages/torch/_utils.py", line 706, in reraise
    raise exception
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 84, in _worker
    output = module(*input, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/fs-computility/ai4chem/luxinyu.p/Spectra2Structure/architecture.py", line 126, in forward
    smile_preds = self.forward_decoder(data, spec_latents)
  File "/fs-computility/ai4chem/luxinyu.p/Spectra2Structure/architecture.py", line 111, in forward_decoder
    pred = self.smiles_decoder(spec_latents, smi)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/fs-computility/ai4chem/luxinyu.p/Spectra2Structure/models/decoder.py", line 109, in forward
    x = self.trfmencoder(
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/transformer.py", line 416, in forward
    output = mod(output, src_mask=mask, is_causal=is_causal, src_key_padding_mask=src_key_padding_mask_for_layers)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/transformer.py", line 720, in forward
    return torch._transformer_encoder_layer_fwd(
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 2048 n 14000 k 512 mat1_ld 512 mat2_ld 512 result_ld 2048 abcType 0 computeType 77 scaleType 0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions