Fix indentation in resume-training checkpoint logic by hq-fang · Pull Request #33 · allenai/molmo

hq-fang · 2025-05-15T18:22:20Z

When resuming training, the checkpoint-loading snippet was incorrectly nested under

if get_global_rank() == 0:
    log.info("Configuration:")
    log.info(cfg)

    if cfg.allow_resume:
        # load checkpoint…

As a result, only GPU rank 0 ever attempted to load the model state. All other ranks idled at the barrier and eventually timed out.

This PR gives a simple fix on the indentation so that every rank invokes the checkpoint loader. So the fixed snippet looks like

if get_global_rank() == 0:
    log.info("Configuration:")
    log.info(cfg)

if cfg.allow_resume:
    # load checkpoint…

With this change, all processes load the checkpoint in parallel, and training resumes smoothly on multi-GPU setups.

Fix indent in resume training

b31475b

chrisc36 self-requested a review May 21, 2025 21:00

hq-fang requested a review from rohun-tripathi June 6, 2025 02:57

Provide feedback