Skip to content
This repository was archived by the owner on Feb 11, 2026. It is now read-only.
This repository was archived by the owner on Feb 11, 2026. It is now read-only.

KeyError: 'LOCAL_RANK' #24

@shoang22

Description

@shoang22

Hello,

When attempting to fine-tune in a multi-gpu SLURM environment, I get the following error:

Mon Sep 18 09:20:06 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-32GB           On  | 00000000:1A:00.0 Off |                    0 |
| N/A   31C    P0              44W / 300W |      0MiB / 32768MiB |      0%   E. Process |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2-32GB           On  | 00000000:1C:00.0 Off |                    0 |
| N/A   27C    P0              41W / 300W |      0MiB / 32768MiB |      0%   E. Process |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2-32GB           On  | 00000000:1D:00.0 Off |                    0 |
| N/A   28C    P0              42W / 300W |      0MiB / 32768MiB |      0%   E. Process |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2-32GB           On  | 00000000:1E:00.0 Off |                    0 |
| N/A   30C    P0              42W / 300W |      0MiB / 32768MiB |      0%   E. Process |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
MASTER_ADDR=mg089
GpuFreq=control_disabled
INFO:    underlay of /etc/localtime required more than 50 (80) bind mounts
INFO:    underlay of /etc/localtime required more than 50 (80) bind mounts
INFO:    underlay of /etc/localtime required more than 50 (80) bind mounts
INFO:    underlay of /etc/localtime required more than 50 (80) bind mounts
INFO:    underlay of /usr/bin/nvidia-smi required more than 50 (251) bind mounts
INFO:    underlay of /usr/bin/nvidia-smi required more than 50 (251) bind mounts
INFO:    underlay of /usr/bin/nvidia-smi required more than 50 (251) bind mounts
INFO:    underlay of /usr/bin/nvidia-smi required more than 50 (251) bind mounts
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Multi-processing is handled by Slurm.
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Multi-processing is handled by Slurm.
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Multi-processing is handled by Slurm.
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Multi-processing is handled by Slurm.
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:50: UserWarning: you defined a validation_step but have no val_dataloader. Skipping validation loop
  warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:50: UserWarning: you defined a validation_step but have no val_dataloader. Skipping validation loop
  warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:50: UserWarning: you defined a validation_step but have no val_dataloader. Skipping validation loop
  warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:50: UserWarning: you defined a validation_step but have no val_dataloader. Skipping validation loop
  warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
INFO:lightning:You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 77, in <module>
    run()
  File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 72, in run
    trainer.fit(model, train_data)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 511, in fit
    self.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 540, in pre_dispatch
    self.accelerator.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 84, in pre_dispatch
    self.training_type_plugin.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 174, in pre_dispatch
    self.init_deepspeed()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 193, in init_deepspeed
    self._initialize_deepspeed_train(model)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 218, in _initialize_deepspeed_train
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 112, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 145, in __init__
    self._configure_with_arguments(args, mpu)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 499, in _configure_with_arguments
    args.local_rank = int(os.environ['LOCAL_RANK'])
  File "/opt/conda/lib/python3.8/os.py", line 675, in __getitem__
    raise KeyError(key) from None
KeyError: 'LOCAL_RANK'
N GPUS: 4
N NODES: 1
You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
INFO:lightning:You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
INFO:lightning:You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
INFO:lightning:You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
N GPUS: 4
N NODES: 1
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 77, in <module>
    run()
  File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 72, in run
    trainer.fit(model, train_data)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 511, in fit
    self.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 540, in pre_dispatch
    self.accelerator.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 84, in pre_dispatch
    self.training_type_plugin.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 174, in pre_dispatch
    self.init_deepspeed()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 193, in init_deepspeed
    self._initialize_deepspeed_train(model)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 218, in _initialize_deepspeed_train
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 112, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 145, in __init__
    self._configure_with_arguments(args, mpu)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 499, in _configure_with_arguments
    args.local_rank = int(os.environ['LOCAL_RANK'])
  File "/opt/conda/lib/python3.8/os.py", line 675, in __getitem__
    raise KeyError(key) from None
KeyError: 'LOCAL_RANK'
N GPUS: 4
N NODES: 1
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 77, in <module>
    run()
  File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 72, in run
    trainer.fit(model, train_data)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 511, in fit
    self.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 540, in pre_dispatch
    self.accelerator.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 84, in pre_dispatch
    self.training_type_plugin.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 174, in pre_dispatch
    self.init_deepspeed()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 193, in init_deepspeed
    self._initialize_deepspeed_train(model)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 218, in _initialize_deepspeed_train
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 112, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 145, in __init__
    self._configure_with_arguments(args, mpu)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 499, in _configure_with_arguments
    args.local_rank = int(os.environ['LOCAL_RANK'])
  File "/opt/conda/lib/python3.8/os.py", line 675, in __getitem__
    raise KeyError(key) from None
KeyError: 'LOCAL_RANK'
N GPUS: 4
N NODES: 1
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 77, in <module>
    run()
  File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 72, in run
    trainer.fit(model, train_data)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 511, in fit
    self.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 540, in pre_dispatch
    self.accelerator.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 84, in pre_dispatch
    self.training_type_plugin.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 174, in pre_dispatch
    self.init_deepspeed()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 193, in init_deepspeed
    self._initialize_deepspeed_train(model)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 218, in _initialize_deepspeed_train
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 112, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 145, in __init__
    self._configure_with_arguments(args, mpu)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 499, in _configure_with_arguments
    args.local_rank = int(os.environ['LOCAL_RANK'])
  File "/opt/conda/lib/python3.8/os.py", line 675, in __getitem__
    raise KeyError(key) from None
KeyError: 'LOCAL_RANK'

My SLURM job request is as follows:

#!/bin/bash
#SBATCH --job-name=chemformer
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH -p gpu
#SBATCH --gres=gpu:4
#SBATCH --mem=64gb
#SBATCH --time=0:15:00
#SBATCH --output=slurm_out/output.%A_%a.txt

module load miniconda3
module load cuda/11.8.0
nvidia-smi

export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))
echo "WORLD_SIZE="$WORLD_SIZE

master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr
echo "MASTER_ADDR="$MASTER_ADDR

CONTAINER="singularity exec --nv $PROJ_DIR/singularity_images/chemformer.sif"

CMD="python -m molbart.fine_tune \
  --dataset uspto_50 \
  --data_path data/seq-to-seq_datasets/uspto_50.pickle \
  --model_path models/pre-trained/combined/step=1000000.ckpt \
  --task backward_prediction \
  --epochs 100 \
  --lr 0.001 \
  --schedule cycle \
  --batch_size 128 \
  --acc_batches 4 \
  --augment all \
  --aug_prob 0.5 \
  --gpus $SLURM_NTASKS_PER_NODE
  "

srun ${CONTAINER} ${CMD}

I am using the original library versions defined in the documentation. Is this behavior expected?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions