Skip to content

Keyerror occurs when auto-scaling happens in AdpatDL scheduler #135

@yuxiangwei0808

Description

@yuxiangwei0808

I have installed adaptdl scheduler and try to use it to do distributed training.
When I test Cifar-10, the created adaptdljob will fail when auto-scaling happens:

INFO:adaptdl.reducer:rank 0 of 2 connecting to 172.30.133.85 on port 47001
INFO:adaptdl.reducer:Master waiting for connections on 47001
INFO:adaptdl.torch:Initializing torch.distributed using tcp://172.30.133.85:40087?rank=0&world_size=2
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
INFO:adaptdl.torch:torch.distributed initialized
Using downloaded and verified file: ./data/cifar-10-python.tar.gz
Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified
Traceback (most recent call last):
File "/workspace/test_adaptdl.py", line 144, in
main()
File "/workspace/test_adaptdl.py", line 127, in main
model = adaptdl.torch.AdaptiveDataParallel(model, optimizer)
File "/opt/conda/lib/python3.7/site-packages/adaptdl/torch/parallel.py", line 89, in init
adaptdl.checkpoint.load_state(self._state)
File "/opt/conda/lib/python3.7/site-packages/adaptdl/checkpoint.py", line 204, in load_state
state.load(f)
File "/opt/conda/lib/python3.7/site-packages/adaptdl/torch/parallel.py", line 228, in load
self.optimizer.load_state_dict(state_dicts[1])
File "/opt/conda/lib/python3.7/site-packages/torch/optim/optimizer.py", line 214, in load_state_dict
self.setstate({'state': state, 'param_groups': param_groups})
File "/opt/conda/lib/python3.7/site-packages/torch/optim/adam.py", line 100, in setstate
step_is_tensor = (len(state_values) != 0) and torch.is_tensor(state_values[0]['step'])
KeyError: 'step'

This error will not occur if I use a stateless optimizer such as SGD, instead of Adam. I have found that this error occurs because the optimizers initialized in the new pods will also initialize GradientNoiseScale, which will set some default values into the optimizers' states. However, the key "step" is missing, which makes the load from checkpoint unsuccessful. So I manually add this key and everything works fine now.

class GradientNoiseScale(object):
    def __init__(self, adp, optimizer,
                 mp_scaler=None,
                 num_replicas=None,
                 accum_scale=None):
        self._adp = adp
        self._optimizer = optimizer
        self._orig_optimizer_zero_grad = optimizer.zero_grad
        self._should_zero_grad = True
        self._mp_scaler = mp_scaler
        self._local_sqr = None
        self._num_replicas = (num_replicas if num_replicas is not None
                              else torch.distributed.get_world_size())
        self._accum_scale = accum_scale or self._num_replicas
        self._prev_grads = None

        self.reset_accumulation()

        self._optimizer.state.setdefault("gns", {
            "progress": 0.0,
            "prev_scale": 0.0,
            "sqr_avg": np.ones(len(optimizer.param_groups)),
            "var_avg": np.zeros(len(optimizer.param_groups)),
            "biased": False,
            # add this line
            "step": torch.tensor(0.),
        })

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions