Skip to content

[codex] Add multiscale surface training and BCE+Dice ignore-label support#799

Open
giorgioangel wants to merge 4 commits intomainfrom
codex/vesuvius-multiscale-bce-ignore2
Open

[codex] Add multiscale surface training and BCE+Dice ignore-label support#799
giorgioangel wants to merge 4 commits intomainfrom
codex/vesuvius-multiscale-bce-ignore2

Conversation

@giorgioangel
Copy link
Copy Markdown
Member

What changed

This PR adds multiscale surface-training support for vesuvius and introduces a BCE+Dice training path that works with the existing ignore_label: 2 surface labels.

Concretely it:

  • loads dataset_config.ome_zarr_resolution into the training config manager and validates it against valid_patch_find_resolution
  • makes patch-cache generation and lookup scale-aware by including the training resolution in the cache key
  • converts training-resolution patch sizes to full-resolution patch sizes for patch finding
  • converts cached full-resolution patch coordinates back to the selected OME-Zarr training level when reading cached patches
  • adds BinaryBCEAndDiceLoss for single-channel surface training with ignore_label: 2
  • adds a GPU fit-probe helper for patch-size and batch-size experiments
  • adds 8 surface training configs covering scale 0/2, 256^3 @ bs=3 and 128^3 @ bs=28, with both MedialSurfaceRecall and BCE+Dice variants
  • adds focused regression tests for multiscale config loading, cache separation, coordinate scaling, and the new BCE+Dice loss path

Why

Two feature gaps were blocking the intended surface experiments:

  1. Multiscale training was not wired correctly end-to-end.
    Patch finding cached full-resolution coordinates, but the dataset could open lower OME-Zarr levels without converting those coordinates back to the training level. Cache keys also did not distinguish between scale 0 and scale 2.

  2. The existing nnUNet BCE+Dice path was not a good fit for the current scalar 0/1/2 surface labels.
    It expects region-based targets, while these runs need single-channel binary logits with ignore_label: 2 preserved.

Impact

  • Surface experiments can now target both OME-Zarr scale 0 and scale 2 correctly.
  • Cache files are isolated per training scale, avoiding cross-scale cache reuse bugs.
  • Surface BCE+Dice runs can use the same ignore-label semantics as the MedialSurfaceRecall runs.
  • The added configs make the 8-run comparison matrix reproducible.

Validation

Executed and verified on the remote H100 node:

  • python3 -m py_compile for the modified Python files
  • uv run --all-extras pytest vesuvius/tests/models/test_surface_multiscale_training.py -q
  • uv run --all-extras pytest vesuvius/tests -k "ome_zarr_resolution or patch_cache or zarr_dataset or surface_multiscale_training" -q
  • fit probe results:
    • 256^3 @ bs=3 fits on a single H100
    • 128^3 @ bs=28 fits on a single H100
  • vesuvius.find_patches completed for the 8 configs, producing 4 unique cache files and 4 expected cache hits across loss variants
  • the corresponding remote training runs were launched successfully with W&B logging enabled

Notes

This PR intentionally scopes to the multiscale training fix, the BCE+Dice compatibility path, the probe helper, tests, and the new experiment configs. It does not include unrelated local or remote worktree changes.

@vercel
Copy link
Copy Markdown

vercel bot commented Mar 28, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
scrollprize-org Ignored Ignored Preview Mar 28, 2026 1:35pm

Request Review

@giorgioangel giorgioangel marked this pull request as ready for review March 28, 2026 08:35
@giorgioangel giorgioangel requested a review from jrudolph as a code owner March 28, 2026 08:35
@giorgioangel giorgioangel requested a review from bruniss March 28, 2026 08:35
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 53d8019ed2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +338 to +340
if coord % scale_factor != 0:
raise ValueError(
f"Cached full-resolution position {position} is not divisible by scale factor {scale_factor}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid hard-failing on non-divisible cached coordinates

This assumes every cached full-resolution coordinate is divisible by 2**ome_zarr_resolution, but find_valid_patches explicitly falls back to full-resolution scanning when the requested valid_patch_find_resolution level is unavailable, and those fallback coordinates are not guaranteed to satisfy that divisibility. In that common fallback path, dataset startup will now raise ValueError when loading cache; even when divisible by chance, scaling is still wrong if _open_zarr also fell back to level 0. The conversion should be based on the actually resolved level (or level availability should be validated up front) rather than assuming divisibility here.

Useful? React with 👍 / 👎.

surface:
out_channels: 1
valid_patch_value: 1
activation: "sigmoid"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Set BCE+Dice surface configs to logit activation

BinaryBCEAndDiceLoss is logit-based (BCEWithLogitsLoss + internal sigmoid), but this config enables activation: "sigmoid". Because NetworkFromConfig applies task activations in eval mode and the validation loop computes losses under model.eval(), validation loss for these runs is computed on already-sigmoided outputs and then transformed again inside the loss, which distorts val-loss tracking and any early-stopping/checkpoint decisions tied to it. These training configs should keep activation: "none" and apply sigmoid only in inference/postprocessing.

Useful? React with 👍 / 👎.

@giorgioangel
Copy link
Copy Markdown
Member Author

@bruniss can we merge this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant