Skip to content

Fine-tuning broken #54

@domef

Description

@domef

I can't load last.ckpt of my fine-tuned model:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[6], [line 5](vscode-notebook-cell:?execution_count=6&line=5)
      [3](vscode-notebook-cell:?execution_count=6&line=3) ckpt="logs/myexp/checkpoints/last.ckpt"
      [4](vscode-notebook-cell:?execution_count=6&line=4) config = OmegaConf.load(f"{config}")
----> [5](vscode-notebook-cell:?execution_count=6&line=5) model = load_model_from_config(config, f"{ckpt}")
      [6](vscode-notebook-cell:?execution_count=6&line=6) sampler = DDIMSampler(model)

Cell In[4], [line 38](vscode-notebook-cell:?execution_count=4&line=38)
     [36](vscode-notebook-cell:?execution_count=4&line=36) if "global_step" in pl_sd:
     [37](vscode-notebook-cell:?execution_count=4&line=37)     print(f"Global Step: {pl_sd['global_step']}")
---> [38](vscode-notebook-cell:?execution_count=4&line=38) sd = pl_sd["state_dict"]
     [39](vscode-notebook-cell:?execution_count=4&line=39) model = instantiate_from_config(config.model)
     [40](vscode-notebook-cell:?execution_count=4&line=40) m, u = model.load_state_dict(sd, strict=False)

KeyError: 'state_dict'

Probably because the model was not saved correctly, after the fine-tuning is finished it crashes:

Epoch 0:  10%| | 61001/616605 [5:46:21<52:34:42,  2.94it/s, loss=0.166, v_num=0, train/l
Saving latest checkpoint...

Traceback (most recent call last):
  File "main.py", line 779, in <module>
    trainer.test(model, data)
  File "/home/federico/Desktop/InST/.venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 911, in test
    return self._call_and_handle_interrupt(self._test_impl, model, dataloaders, ckpt_path, verbose, datamodule)
  File "/home/federico/Desktop/InST/.venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/federico/Desktop/InST/.venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 954, in _test_impl
    results = self._run(model, ckpt_path=self.tested_ckpt_path)
  File "/home/federico/Desktop/InST/.venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1128, in _run
    verify_loop_configurations(self)
  File "/home/federico/Desktop/InST/.venv/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py", line 42, in verify_loop_configurations
    __verify_eval_loop_configuration(trainer, model, "test")
  File "/home/federico/Desktop/InST/.venv/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py", line 186, in __verify_eval_loop_configuration
    raise MisconfigurationException(f"No `{loader_name}()` method defined to run `Trainer.{trainer_method}`.")
pytorch_lightning.utilities.exceptions.MisconfigurationException: No `test_dataloader()` method defined to run `Trainer.test`.

Edit:
Even if i comment this lines and no exception is raised, the checkpoint is not saved correctly:

        if not opt.no_test and not trainer.interrupted:
            trainer.test(model, data)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions