[codex] Add multiscale surface training and BCE+Dice ignore-label support#799
[codex] Add multiscale surface training and BCE+Dice ignore-label support#799giorgioangel wants to merge 4 commits intomainfrom
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub. 1 Skipped Deployment
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 53d8019ed2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if coord % scale_factor != 0: | ||
| raise ValueError( | ||
| f"Cached full-resolution position {position} is not divisible by scale factor {scale_factor}" |
There was a problem hiding this comment.
Avoid hard-failing on non-divisible cached coordinates
This assumes every cached full-resolution coordinate is divisible by 2**ome_zarr_resolution, but find_valid_patches explicitly falls back to full-resolution scanning when the requested valid_patch_find_resolution level is unavailable, and those fallback coordinates are not guaranteed to satisfy that divisibility. In that common fallback path, dataset startup will now raise ValueError when loading cache; even when divisible by chance, scaling is still wrong if _open_zarr also fell back to level 0. The conversion should be based on the actually resolved level (or level availability should be validated up front) rather than assuming divisibility here.
Useful? React with 👍 / 👎.
| surface: | ||
| out_channels: 1 | ||
| valid_patch_value: 1 | ||
| activation: "sigmoid" |
There was a problem hiding this comment.
Set BCE+Dice surface configs to logit activation
BinaryBCEAndDiceLoss is logit-based (BCEWithLogitsLoss + internal sigmoid), but this config enables activation: "sigmoid". Because NetworkFromConfig applies task activations in eval mode and the validation loop computes losses under model.eval(), validation loss for these runs is computed on already-sigmoided outputs and then transformed again inside the loss, which distorts val-loss tracking and any early-stopping/checkpoint decisions tied to it. These training configs should keep activation: "none" and apply sigmoid only in inference/postprocessing.
Useful? React with 👍 / 👎.
|
@bruniss can we merge this? |
What changed
This PR adds multiscale surface-training support for
vesuviusand introduces a BCE+Dice training path that works with the existingignore_label: 2surface labels.Concretely it:
dataset_config.ome_zarr_resolutioninto the training config manager and validates it againstvalid_patch_find_resolutionBinaryBCEAndDiceLossfor single-channel surface training withignore_label: 2256^3 @ bs=3and128^3 @ bs=28, with both MedialSurfaceRecall and BCE+Dice variantsWhy
Two feature gaps were blocking the intended surface experiments:
Multiscale training was not wired correctly end-to-end.
Patch finding cached full-resolution coordinates, but the dataset could open lower OME-Zarr levels without converting those coordinates back to the training level. Cache keys also did not distinguish between scale 0 and scale 2.
The existing nnUNet BCE+Dice path was not a good fit for the current scalar
0/1/2surface labels.It expects region-based targets, while these runs need single-channel binary logits with
ignore_label: 2preserved.Impact
Validation
Executed and verified on the remote H100 node:
python3 -m py_compilefor the modified Python filesuv run --all-extras pytest vesuvius/tests/models/test_surface_multiscale_training.py -quv run --all-extras pytest vesuvius/tests -k "ome_zarr_resolution or patch_cache or zarr_dataset or surface_multiscale_training" -q256^3 @ bs=3fits on a single H100128^3 @ bs=28fits on a single H100vesuvius.find_patchescompleted for the 8 configs, producing 4 unique cache files and 4 expected cache hits across loss variantsNotes
This PR intentionally scopes to the multiscale training fix, the BCE+Dice compatibility path, the probe helper, tests, and the new experiment configs. It does not include unrelated local or remote worktree changes.