-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
Dear developer:
Hello! This is a very excellent work. I'm recently reproducing the training process of your model. I wonder if it's possible to enable data parallelism in the code. When I tried to use two GPUs, an error occurred, as shown below.
Could you please tell me what changes I need to make to the existing code? Thank you.
wandb: W&B syncing is set to `offline` in this directory. Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
No of GPUs available 2
121944it [00:05, 22267.63it/s]
121944it [00:05, 22510.61it/s]
Normalizing each spectrum individually
wandb: logging graph, to disable use `wandb.watch(log_graph=False)`
SMILES WILL BE RANDOMIZED
no of batches 250
no of batches 5
no of batches 50
Starting Training
Traceback (most recent call last):/250 | Loss: 5.1767811775207525
File "/fs-computility/ai4chem/luxinyu.p/Spectra2Structure/run.py", line 92, in run
train_total(config, model, dataloaders, optimizer, loss_fn, logs, 000,1000)
File "/fs-computility/ai4chem/luxinyu.p/Spectra2Structure/train_utils.py", line 308, in train_total
vl = validate(config, model, dataloaders['val'], epoch, optimizer, loss_fn )
File "/fs-computility/ai4chem/luxinyu.p/Spectra2Structure/train_utils.py", line 160, in validate
mol_latents, spec_latents, smile_preds, logit_scale, ids = model(data)
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
result = forward_call(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 186, in forward
outputs = self.parallel_apply(replicas, inputs, module_kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 201, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 109, in parallel_apply
output.reraise()
File "/root/miniconda3/lib/python3.10/site-packages/torch/_utils.py", line 706, in reraise
raise exception
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 84, in _worker
output = module(*input, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
result = forward_call(*args, **kwargs)
File "/fs-computility/ai4chem/luxinyu.p/Spectra2Structure/architecture.py", line 126, in forward
smile_preds = self.forward_decoder(data, spec_latents)
File "/fs-computility/ai4chem/luxinyu.p/Spectra2Structure/architecture.py", line 111, in forward_decoder
pred = self.smiles_decoder(spec_latents, smi)
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/fs-computility/ai4chem/luxinyu.p/Spectra2Structure/models/decoder.py", line 109, in forward
x = self.trfmencoder(
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/transformer.py", line 416, in forward
output = mod(output, src_mask=mask, is_causal=is_causal, src_key_padding_mask=src_key_padding_mask_for_layers)
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/transformer.py", line 720, in forward
return torch._transformer_encoder_layer_fwd(
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 2048 n 14000 k 512 mat1_ld 512 mat2_ld 512 result_ld 2048 abcType 0 computeType 77 scaleType 0
Metadata
Metadata
Assignees
Labels
No labels