Skip to content

Fix indentation in resume-training checkpoint logic#33

Open
hq-fang wants to merge 1 commit intoallenai:mainfrom
hq-fang:main
Open

Fix indentation in resume-training checkpoint logic#33
hq-fang wants to merge 1 commit intoallenai:mainfrom
hq-fang:main

Conversation

@hq-fang
Copy link
Copy Markdown

@hq-fang hq-fang commented May 15, 2025

When resuming training, the checkpoint-loading snippet was incorrectly nested under

if get_global_rank() == 0:
    log.info("Configuration:")
    log.info(cfg)

    if cfg.allow_resume:
        # load checkpoint…

As a result, only GPU rank 0 ever attempted to load the model state. All other ranks idled at the barrier and eventually timed out.

This PR gives a simple fix on the indentation so that every rank invokes the checkpoint loader. So the fixed snippet looks like

if get_global_rank() == 0:
    log.info("Configuration:")
    log.info(cfg)

if cfg.allow_resume:
    # load checkpoint…

With this change, all processes load the checkpoint in parallel, and training resumes smoothly on multi-GPU setups.

@chrisc36 chrisc36 self-requested a review May 21, 2025 21:00
@hq-fang hq-fang requested a review from rohun-tripathi June 6, 2025 02:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant