Skip to content

fix: validate checkpoint files before loading and surface actionable errors for empty/corrupt weights#668

Open
khanfs wants to merge 1 commit intojwohlwend:mainfrom
khanfs:fix/validate-checkpoint-files
Open

fix: validate checkpoint files before loading and surface actionable errors for empty/corrupt weights#668
khanfs wants to merge 1 commit intojwohlwend:mainfrom
khanfs:fix/validate-checkpoint-files

Conversation

@khanfs
Copy link
Copy Markdown

@khanfs khanfs commented Apr 2, 2026

PR: Validate checkpoint files before loading and surface actionable errors for empty/corrupt weights

Closes #664


Problem

When urllib.request.urlretrieve is interrupted mid-download (network drop, disk full, SIGINT, etc.) it leaves a zero-byte or truncated .ckpt file on disk and does not raise an exception. On the next boltz predict run:

  1. The file already exists → the download is skipped.
  2. The corrupted path is passed to load_from_checkpoint.
  3. PyTorch Lightning aborts deep in its unpickling stack with a bareAborted! and no indication of which file is at fault or how to fix it.

Issue #664 documents this exact failure mode and explicitly requests:

  • size validation after download
  • a clear error before load_from_checkpoint
  • a docs note explaining the fix

Changes

src/boltz/main.py

1. New constant

_MIN_CHECKPOINT_SIZE_BYTES = 1 * 1024 * 1024  # 1 MB

Real Boltz checkpoints are hundreds of MB. 1 MB catches every practical failure (empty file, tiny partial download, accidental placeholder) with no risk of false positives on legitimate user-supplied checkpoints.

2. New helper: validate_checkpoint(path, label)

def validate_checkpoint(path: Path, label: str = "Checkpoint") -> None:
    """Validate that a checkpoint file exists and appears intact."""
    if not path.exists():
        raise click.ClickException(
            f"{label} not found: {path}. "
            "Delete the file if it exists and rerun to trigger a fresh download."
        )
    try:
        size = path.stat().st_size
    except OSError as e:
        raise click.ClickException(
            f"{label} is not readable: {path}. Error: {e}"
        ) from e
    if size < _MIN_CHECKPOINT_SIZE_BYTES:
        raise click.ClickException(
            f"{label} appears empty or corrupted: {path} "
            f"({size:,} bytes, expected at least {_MIN_CHECKPOINT_SIZE_BYTES:,} bytes). "
            "Delete the file and rerun to trigger a fresh download."
        )

Uses click.ClickException so Click prints Error: … cleanly with no traceback, matching the style of every other user-facing error in this file.

3. Post-download validation (download_boltz1, download_boltz2)

One call appended after each urlretrieve block:

validate_checkpoint(model, label="Boltz-1 checkpoint")
validate_checkpoint(model, label="Boltz-2 checkpoint")
validate_checkpoint(affinity_model, label="Boltz-2 affinity checkpoint")

This means a bad write fails at download time, not minutes later when the
model starts loading.

4. Pre-flight validation in predict() — two call sites

Before the structure model load:

# Validate checkpoint before attempting to load it, so that a
# corrupted or empty file (e.g. from an interrupted download) surfaces
# as a clear, actionable error rather than a cryptic Aborted! message.
validate_checkpoint(
    Path(checkpoint),
    label=f"{'Boltz-2' if model == 'boltz2' else 'Boltz-1'} checkpoint",
)

Before the affinity model load:

validate_checkpoint(
    Path(affinity_checkpoint),
    label="Boltz-2 affinity checkpoint",
)

This matters for --checkpoint / --affinity_checkpoint paths too:
click.Path(exists=True) only checks existence, not size.

docs/prediction.md

Adds a Troubleshooting section at the end of the file with:

  • the symptom (Aborted!)
  • the root cause (interrupted download)
  • the one-line fix (rm ~/.boltz/boltz2_conf.ckpt && boltz predict …)

Before / After

# Before (empty checkpoint file, no useful output):
$ truncate -s 0 ~/.boltz/boltz2_conf.ckpt
$ boltz predict test.yaml
...
Running structure prediction for 1 input.
Aborted!

# After:
$ truncate -s 0 ~/.boltz/boltz2_conf.ckpt
$ boltz predict test.yaml
...
Error: Boltz-2 checkpoint appears empty or corrupted:
/home/user/.boltz/boltz2_conf.ckpt (0 bytes, expected at least
1,048,576 bytes). Delete the file and rerun to trigger a fresh download.

Design notes


Testing

To reproduce the bug and verify the fix:

# Setup
pip install boltz -U
boltz predict some_input.yaml  # run once to populate cache

# Simulate a corrupt checkpoint
truncate -s 0 ~/.boltz/boltz2_conf.ckpt

# Without this PR: Aborted!
# With this PR:
boltz predict some_input.yaml
# Error: Boltz-2 checkpoint appears empty or corrupted: ...

To test the post-download guard specifically, the download URLs would need to be mocked to return an empty body - that is left as a follow-up unit test if the project adopts a test for the download path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Validate checkpoint files after download / before load_from_checkpoint to surface corrupt or empty weights

1 participant