This repository was archived by the owner on Feb 11, 2026. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 52
This repository was archived by the owner on Feb 11, 2026. It is now read-only.
KeyError: 'LOCAL_RANK' #24
Copy link
Copy link
Open
Description
Hello,
When attempting to fine-tune in a multi-gpu SLURM environment, I get the following error:
Mon Sep 18 09:20:06 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-SXM2-32GB On | 00000000:1A:00.0 Off | 0 |
| N/A 31C P0 44W / 300W | 0MiB / 32768MiB | 0% E. Process |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2-32GB On | 00000000:1C:00.0 Off | 0 |
| N/A 27C P0 41W / 300W | 0MiB / 32768MiB | 0% E. Process |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2-32GB On | 00000000:1D:00.0 Off | 0 |
| N/A 28C P0 42W / 300W | 0MiB / 32768MiB | 0% E. Process |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2-32GB On | 00000000:1E:00.0 Off | 0 |
| N/A 30C P0 42W / 300W | 0MiB / 32768MiB | 0% E. Process |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
MASTER_ADDR=mg089
GpuFreq=control_disabled
INFO: underlay of /etc/localtime required more than 50 (80) bind mounts
INFO: underlay of /etc/localtime required more than 50 (80) bind mounts
INFO: underlay of /etc/localtime required more than 50 (80) bind mounts
INFO: underlay of /etc/localtime required more than 50 (80) bind mounts
INFO: underlay of /usr/bin/nvidia-smi required more than 50 (251) bind mounts
INFO: underlay of /usr/bin/nvidia-smi required more than 50 (251) bind mounts
INFO: underlay of /usr/bin/nvidia-smi required more than 50 (251) bind mounts
INFO: underlay of /usr/bin/nvidia-smi required more than 50 (251) bind mounts
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Multi-processing is handled by Slurm.
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Multi-processing is handled by Slurm.
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Multi-processing is handled by Slurm.
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Multi-processing is handled by Slurm.
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:50: UserWarning: you defined a validation_step but have no val_dataloader. Skipping validation loop
warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:50: UserWarning: you defined a validation_step but have no val_dataloader. Skipping validation loop
warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:50: UserWarning: you defined a validation_step but have no val_dataloader. Skipping validation loop
warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:50: UserWarning: you defined a validation_step but have no val_dataloader. Skipping validation loop
warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
INFO:lightning:You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 77, in <module>
run()
File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 72, in run
trainer.fit(model, train_data)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 511, in fit
self.pre_dispatch()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 540, in pre_dispatch
self.accelerator.pre_dispatch()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 84, in pre_dispatch
self.training_type_plugin.pre_dispatch()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 174, in pre_dispatch
self.init_deepspeed()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 193, in init_deepspeed
self._initialize_deepspeed_train(model)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 218, in _initialize_deepspeed_train
model, optimizer, _, lr_scheduler = deepspeed.initialize(
File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 112, in initialize
engine = DeepSpeedEngine(args=args,
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 145, in __init__
self._configure_with_arguments(args, mpu)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 499, in _configure_with_arguments
args.local_rank = int(os.environ['LOCAL_RANK'])
File "/opt/conda/lib/python3.8/os.py", line 675, in __getitem__
raise KeyError(key) from None
KeyError: 'LOCAL_RANK'
N GPUS: 4
N NODES: 1
You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
INFO:lightning:You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
INFO:lightning:You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
INFO:lightning:You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
N GPUS: 4
N NODES: 1
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 77, in <module>
run()
File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 72, in run
trainer.fit(model, train_data)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 511, in fit
self.pre_dispatch()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 540, in pre_dispatch
self.accelerator.pre_dispatch()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 84, in pre_dispatch
self.training_type_plugin.pre_dispatch()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 174, in pre_dispatch
self.init_deepspeed()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 193, in init_deepspeed
self._initialize_deepspeed_train(model)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 218, in _initialize_deepspeed_train
model, optimizer, _, lr_scheduler = deepspeed.initialize(
File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 112, in initialize
engine = DeepSpeedEngine(args=args,
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 145, in __init__
self._configure_with_arguments(args, mpu)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 499, in _configure_with_arguments
args.local_rank = int(os.environ['LOCAL_RANK'])
File "/opt/conda/lib/python3.8/os.py", line 675, in __getitem__
raise KeyError(key) from None
KeyError: 'LOCAL_RANK'
N GPUS: 4
N NODES: 1
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 77, in <module>
run()
File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 72, in run
trainer.fit(model, train_data)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 511, in fit
self.pre_dispatch()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 540, in pre_dispatch
self.accelerator.pre_dispatch()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 84, in pre_dispatch
self.training_type_plugin.pre_dispatch()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 174, in pre_dispatch
self.init_deepspeed()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 193, in init_deepspeed
self._initialize_deepspeed_train(model)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 218, in _initialize_deepspeed_train
model, optimizer, _, lr_scheduler = deepspeed.initialize(
File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 112, in initialize
engine = DeepSpeedEngine(args=args,
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 145, in __init__
self._configure_with_arguments(args, mpu)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 499, in _configure_with_arguments
args.local_rank = int(os.environ['LOCAL_RANK'])
File "/opt/conda/lib/python3.8/os.py", line 675, in __getitem__
raise KeyError(key) from None
KeyError: 'LOCAL_RANK'
N GPUS: 4
N NODES: 1
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 77, in <module>
run()
File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 72, in run
trainer.fit(model, train_data)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 511, in fit
self.pre_dispatch()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 540, in pre_dispatch
self.accelerator.pre_dispatch()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 84, in pre_dispatch
self.training_type_plugin.pre_dispatch()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 174, in pre_dispatch
self.init_deepspeed()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 193, in init_deepspeed
self._initialize_deepspeed_train(model)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 218, in _initialize_deepspeed_train
model, optimizer, _, lr_scheduler = deepspeed.initialize(
File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 112, in initialize
engine = DeepSpeedEngine(args=args,
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 145, in __init__
self._configure_with_arguments(args, mpu)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 499, in _configure_with_arguments
args.local_rank = int(os.environ['LOCAL_RANK'])
File "/opt/conda/lib/python3.8/os.py", line 675, in __getitem__
raise KeyError(key) from None
KeyError: 'LOCAL_RANK'
My SLURM job request is as follows:
#!/bin/bash
#SBATCH --job-name=chemformer
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH -p gpu
#SBATCH --gres=gpu:4
#SBATCH --mem=64gb
#SBATCH --time=0:15:00
#SBATCH --output=slurm_out/output.%A_%a.txt
module load miniconda3
module load cuda/11.8.0
nvidia-smi
export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))
echo "WORLD_SIZE="$WORLD_SIZE
master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr
echo "MASTER_ADDR="$MASTER_ADDR
CONTAINER="singularity exec --nv $PROJ_DIR/singularity_images/chemformer.sif"
CMD="python -m molbart.fine_tune \
--dataset uspto_50 \
--data_path data/seq-to-seq_datasets/uspto_50.pickle \
--model_path models/pre-trained/combined/step=1000000.ckpt \
--task backward_prediction \
--epochs 100 \
--lr 0.001 \
--schedule cycle \
--batch_size 128 \
--acc_batches 4 \
--augment all \
--aug_prob 0.5 \
--gpus $SLURM_NTASKS_PER_NODE
"
srun ${CONTAINER} ${CMD}
I am using the original library versions defined in the documentation. Is this behavior expected?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels