fix: validate checkpoint files before loading and surface actionable errors for empty/corrupt weights by khanfs · Pull Request #668 · jwohlwend/boltz

khanfs · 2026-04-02T09:43:57Z

PR: Validate checkpoint files before loading and surface actionable errors for empty/corrupt weights

Closes #664

Problem

When urllib.request.urlretrieve is interrupted mid-download (network drop, disk full, SIGINT, etc.) it leaves a zero-byte or truncated .ckpt file on disk and does not raise an exception. On the next boltz predict run:

The file already exists → the download is skipped.
The corrupted path is passed to load_from_checkpoint.
PyTorch Lightning aborts deep in its unpickling stack with a bareAborted! and no indication of which file is at fault or how to fix it.

Issue #664 documents this exact failure mode and explicitly requests:

size validation after download
a clear error before load_from_checkpoint
a docs note explaining the fix

Changes

`src/boltz/main.py`

1. New constant

_MIN_CHECKPOINT_SIZE_BYTES = 1 * 1024 * 1024  # 1 MB

Real Boltz checkpoints are hundreds of MB. 1 MB catches every practical failure (empty file, tiny partial download, accidental placeholder) with no risk of false positives on legitimate user-supplied checkpoints.

2. New helper: validate_checkpoint(path, label)

def validate_checkpoint(path: Path, label: str = "Checkpoint") -> None:
    """Validate that a checkpoint file exists and appears intact."""
    if not path.exists():
        raise click.ClickException(
            f"{label} not found: {path}. "
            "Delete the file if it exists and rerun to trigger a fresh download."
        )
    try:
        size = path.stat().st_size
    except OSError as e:
        raise click.ClickException(
            f"{label} is not readable: {path}. Error: {e}"
        ) from e
    if size < _MIN_CHECKPOINT_SIZE_BYTES:
        raise click.ClickException(
            f"{label} appears empty or corrupted: {path} "
            f"({size:,} bytes, expected at least {_MIN_CHECKPOINT_SIZE_BYTES:,} bytes). "
            "Delete the file and rerun to trigger a fresh download."
        )

Uses click.ClickException so Click prints Error: … cleanly with no traceback, matching the style of every other user-facing error in this file.

3. Post-download validation (download_boltz1, download_boltz2)

One call appended after each urlretrieve block:

validate_checkpoint(model, label="Boltz-1 checkpoint")
validate_checkpoint(model, label="Boltz-2 checkpoint")
validate_checkpoint(affinity_model, label="Boltz-2 affinity checkpoint")

This means a bad write fails at download time, not minutes later when the
model starts loading.

4. Pre-flight validation in predict() — two call sites

Before the structure model load:

# Validate checkpoint before attempting to load it, so that a
# corrupted or empty file (e.g. from an interrupted download) surfaces
# as a clear, actionable error rather than a cryptic Aborted! message.
validate_checkpoint(
    Path(checkpoint),
    label=f"{'Boltz-2' if model == 'boltz2' else 'Boltz-1'} checkpoint",
)

Before the affinity model load:

validate_checkpoint(
    Path(affinity_checkpoint),
    label="Boltz-2 affinity checkpoint",
)

This matters for --checkpoint / --affinity_checkpoint paths too:
click.Path(exists=True) only checks existence, not size.

`docs/prediction.md`

Adds a Troubleshooting section at the end of the file with:

the symptom (Aborted!)
the root cause (interrupted download)
the one-line fix (rm ~/.boltz/boltz2_conf.ckpt && boltz predict …)

Before / After

# Before (empty checkpoint file, no useful output):
$ truncate -s 0 ~/.boltz/boltz2_conf.ckpt
$ boltz predict test.yaml
...
Running structure prediction for 1 input.
Aborted!

# After:
$ truncate -s 0 ~/.boltz/boltz2_conf.ckpt
$ boltz predict test.yaml
...
Error: Boltz-2 checkpoint appears empty or corrupted:
/home/user/.boltz/boltz2_conf.ckpt (0 bytes, expected at least
1,048,576 bytes). Delete the file and rerun to trigger a fresh download.

Design notes

No torch.load probe — a full pickle-deserialization check would take the
same time as loading the model. The size floor is fast, zero-dependency, and
catches all reported cases in Validate checkpoint files after download / before load_from_checkpoint to surface corrupt or empty weights #664.
No change to model logic — all changes are in the CLI layer.
Backward compatible — users passing valid --checkpoint paths are
unaffected. The only new behaviour is a clear error where there was
previously a silent crash.
click.ClickException not RuntimeError — keeps the error output
consistent with the rest of the CLI and suppresses the traceback that
confused users in Validate checkpoint files after download / before load_from_checkpoint to surface corrupt or empty weights #664.

Testing

To reproduce the bug and verify the fix:

# Setup
pip install boltz -U
boltz predict some_input.yaml  # run once to populate cache

# Simulate a corrupt checkpoint
truncate -s 0 ~/.boltz/boltz2_conf.ckpt

# Without this PR: Aborted!
# With this PR:
boltz predict some_input.yaml
# Error: Boltz-2 checkpoint appears empty or corrupted: ...

To test the post-download guard specifically, the download URLs would need to be mocked to return an empty body - that is left as a follow-up unit test if the project adopts a test for the download path.

…ors for empty/corrupt weights Closes jwohlwend#664

fix: validate checkpoint files before loading, surface actionable err…

2b834d7

…ors for empty/corrupt weights Closes jwohlwend#664

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: validate checkpoint files before loading and surface actionable errors for empty/corrupt weights#668

fix: validate checkpoint files before loading and surface actionable errors for empty/corrupt weights#668
khanfs wants to merge 1 commit intojwohlwend:mainfrom
khanfs:fix/validate-checkpoint-files

khanfs commented Apr 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

khanfs commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR: Validate checkpoint files before loading and surface actionable errors for empty/corrupt weights

Problem

Changes

src/boltz/main.py

docs/prediction.md

Before / After

Design notes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

khanfs commented Apr 2, 2026 •

edited

Loading

`src/boltz/main.py`

`docs/prediction.md`