fix: validate checkpoint files before loading and surface actionable errors for empty/corrupt weights#668
Open
khanfs wants to merge 1 commit intojwohlwend:mainfrom
Open
Conversation
…ors for empty/corrupt weights Closes jwohlwend#664
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR: Validate checkpoint files before loading and surface actionable errors for empty/corrupt weights
Closes #664
Problem
When
urllib.request.urlretrieveis interrupted mid-download (network drop, disk full, SIGINT, etc.) it leaves a zero-byte or truncated.ckptfile on disk and does not raise an exception. On the nextboltz predictrun:load_from_checkpoint.Aborted!and no indication of which file is at fault or how to fix it.Issue #664 documents this exact failure mode and explicitly requests:
load_from_checkpointChanges
src/boltz/main.py1. New constant
Real Boltz checkpoints are hundreds of MB. 1 MB catches every practical failure (empty file, tiny partial download, accidental placeholder) with no risk of false positives on legitimate user-supplied checkpoints.
2. New helper:
validate_checkpoint(path, label)Uses
click.ClickExceptionso Click printsError: …cleanly with no traceback, matching the style of every other user-facing error in this file.3. Post-download validation (
download_boltz1,download_boltz2)One call appended after each
urlretrieveblock:This means a bad write fails at download time, not minutes later when the
model starts loading.
4. Pre-flight validation in
predict()— two call sitesBefore the structure model load:
Before the affinity model load:
This matters for
--checkpoint/--affinity_checkpointpaths too:click.Path(exists=True)only checks existence, not size.docs/prediction.mdAdds a Troubleshooting section at the end of the file with:
Aborted!)rm ~/.boltz/boltz2_conf.ckpt && boltz predict …)Before / After
Design notes
same time as loading the model. The size floor is fast, zero-dependency, and
catches all reported cases in Validate checkpoint files after download / before load_from_checkpoint to surface corrupt or empty weights #664.
--checkpointpaths areunaffected. The only new behaviour is a clear error where there was
previously a silent crash.
click.ClickExceptionnotRuntimeError— keeps the error outputconsistent with the rest of the CLI and suppresses the traceback that
confused users in Validate checkpoint files after download / before load_from_checkpoint to surface corrupt or empty weights #664.
Testing
To reproduce the bug and verify the fix:
To test the post-download guard specifically, the download URLs would need to be mocked to return an empty body - that is left as a follow-up unit test if the project adopts a test for the download path.