Skip to content

release: v0.6.0 staging-then-prod bake/derive verification workflow (handoff) #214

@gerchowl

Description

@gerchowl

Purpose

Self-contained handoff for the v0.6.0 release-cut verification. The full ADR-0012 chain code has landed (#188 / #189 / #190 / #196 / #199 / #204 / #205 / #207-{208,209} / #210 / #211 / #213). We've verified end-to-end at `--limit 2` against `gerchowl/mat-vis-tst` from both macOS uv and anvil-dev (NixOS+podman dagger). What we have NOT verified: full-scale bakes (no limit), the full matrix workflow, and the open #83 chunk-split-at-1.8GB risk during derive.

This issue is the last gate before #179 (prod bake) fires. Each phase has an explicit go/no-go criterion. If anything fails, file a fix, re-run from the failing phase. Stop before phase 7 unless every prior phase is green.

Prerequisites

  • `HF_TOKEN` with write access to both `gerchowl/mat-vis-tst` AND `gerchowl/mat-vis` (the latter only consumed in phase 7)
  • `mat-vis-tst` is at clean baseline: `refs` shows only `main`, `audit-orphans` reports `total_lfs == 0`. Reset commands:
    uv run --with huggingface_hub python -c "from huggingface_hub import HfApi; from pathlib import Path; api=HfApi(token=Path('~/.cache/huggingface/token').expanduser().read_text().strip()); [api.delete_branch(repo_id='gerchowl/mat-vis-tst', repo_type='dataset', branch=b['name']) for b in api.list_repo_refs(repo_id='gerchowl/mat-vis-tst', repo_type='dataset').branches if b.name != 'main']"
    MAT_VIS_AUDIT_FORCE=1 uv run --with-editable . --extra baker mat-vis-baker audit-orphans --repo gerchowl/mat-vis-tst --delete
  • Local `dev` HEAD includes docs(baker/ambientcg): note non-uniform channel sets are upstream truth #213 (`docs(baker/ambientcg)`). Verify: `git log --oneline -1 origin/dev` should be `6b6bb0e` or later.
  • anvil-dev's repo + dagger CLI ready (already set up in this session — see notes in test: tighten mock contract + expand e2e.yml so future #207-class bugs cannot ship #210 chat).

Phase 1 — single-material baseline (sanity, ~3 min)

Goal: confirm nothing regressed since the last verification.

# Bake one material per source, single tier
TAG=v0.0.0-phase1
for src in polyhaven ambientcg gpuopen; do
  uv run --with-editable . --extra baker python -c "
import tempfile; from pathlib import Path
from mat_vis_baker.hf_bake import bake_one
tok = Path('~/.cache/huggingface/token').expanduser().read_text().strip()
with tempfile.TemporaryDirectory() as td:
  print('$src:', bake_one(source='$src', tier='1k', release_tag='$TAG',
    work_dir=Path(td), repo_id='gerchowl/mat-vis-tst', hf_token=tok,
    limit=1, batch_size=1))
"
done

Go criterion: all 3 bakes return `ok=1, failed=0`. Manifest at `/resolve/v0.0.0-phase1/release-manifest.json` lists all 3 sources under `tiers.1k`.

Teardown: `api.delete_branch(branch='v0.0.0-phase1')` + audit-orphans `--delete`.


Phase 2 — per-source full bake (per-source scale, ~30-60 min each)

Goal: exercise full upstream catalog size against HF rate limits, one source at a time. This is where #83 (chunk-split) bites if it's still latent.

For each source: trigger `bake.yml` workflow against `mat-vis-tst` with NO `--limit`. Each source runs serially (don't fan out yet).

TAG=v0.0.0-phase2
for src in polyhaven ambientcg gpuopen; do
  gh workflow run bake.yml \
    -f source=$src -f tier=1k -f release-tag=$TAG \
    -f repo-id=gerchowl/mat-vis-tst -f dry-run=false
  # Poll: gh run list --workflow=bake.yml --limit 1
  # Wait for completion before next iter — preserves "single source at a time"
done

Go criterion per source:

  • Workflow exits 0
  • `/1k/.tier_complete` sentinel landed
  • `.json` catalog lists every material that was meant to bake
  • Bake duration < 90 min (HF commit pacing healthy)
  • No batch with >25k LFS files (HF cap) or >1 GB payload (commit cap)
  • `mat-vis-baker audit-orphans --repo gerchowl/mat-vis-tst --revision $TAG` reports `orphans <= total_lfs * 0.05` (95% of blobs referenced — accounts for transient mid-batch pre-commit blobs only)

No-go signals:

  • HF 429 storm (rate limit hit) → file follow-up to add backoff in commit pacing
  • A material fails with `error` (preflight resume should let us re-run)
  • Sentinel missing → bake didn't reach completion, re-run with same tag (preflight should resume)

Teardown: keep tag through phase 4 — derives need it.


Phase 3 — full bake matrix (parallel, ~30-60 min wallclock)

Goal: exercise CAS retry under real matrix concurrency.

TAG=v0.0.0-phase3
# Reset state first — this is the matrix dress rehearsal, fresh tag
# Trigger all sources in parallel via the matrix workflow
gh workflow run bake.yml \
  -f source=__all__ -f tier=1k -f release-tag=$TAG \
  -f repo-id=gerchowl/mat-vis-tst -f dry-run=false

If `bake.yml` doesn't yet support `all`, fan out manually:

for src in polyhaven ambientcg gpuopen; do
  gh workflow run bake.yml -f source=$src -f tier=1k \
    -f release-tag=$TAG -f repo-id=gerchowl/mat-vis-tst &
done
wait

Go criterion:

  • All 3 jobs exit 0
  • `release-manifest.json` accumulated correctly: `sources` keys = {polyhaven, ambientcg, gpuopen}, every entry's `tiers.1k.complete = true`
  • CAS retry log lines (`grep "manifest CAS retry"` in workflow logs) exist for at least one job — proves the lock fired under contention
  • `audit-orphans` clean post-run

No-go:

Teardown: keep through phase 4.


Phase 4 — derive matrix (1k → 512, ~15-30 min)

Goal: exercise the per-file derive pipeline at full scale.

TAG=v0.0.0-phase3  # reuse phase 3's tag
for src in polyhaven ambientcg gpuopen; do
  gh workflow run derive.yml \
    -f source=$src -f source-tier=1k -f target-tier=512 \
    -f release-tag=$TAG -f repo-id=gerchowl/mat-vis-tst &
done
wait

Go criterion:


Phase 5 — multi-language client verification

Goal: all four reference clients fetch the same bytes from the staged tag.

TAG=v0.0.0-phase3
# Each client run — tagless invocation should pick up MAT_VIS_E2E_TAG
MAT_VIS_E2E=1 MAT_VIS_E2E_TAG=$TAG node --test clients/js/test_client.mjs
MAT_VIS_E2E=1 MAT_VIS_E2E_TAG=$TAG bash clients/test_client.sh
(cd clients/rust && MAT_VIS_E2E=1 MAT_VIS_E2E_TAG=$TAG cargo test)
MAT_VIS_E2E=1 MAT_VIS_LIVE_TAG=$TAG uv run --with-editable . --extra baker --with clients/python --with pytest pytest tests/e2e/test_per_file_roundtrip.py::TestPythonClientFetchTextureRoundTrip -v

Go criterion: each suite reports its full pass count (JS 8, shell 13, Rust 16, Python E2E 2). Bytes verified PNG/KTX2 magic.


Phase 6 — opt-in concurrency stress (optional but recommended)

Goal: pressure the CAS retry budget; data-driven Option-B trigger.

MAT_VIS_E2E=1 MAT_VIS_E2E_CONCURRENCY=3 \
  uv run --with-editable . --extra baker --with clients/python --with pytest \
    pytest tests/e2e/test_per_file_roundtrip.py::TestConcurrentBakesShareTag -v

Go criterion: test passes; final manifest contains every source. If retry exhaustion happens, the test fails — that's the trigger to escalate to fragments+merge (Option B per #210).


Phase 7 — handoff to #179 (prod bake)

Only fire if phases 1-5 (and ideally 6) all green.

This is where #179 takes over. Operator runs the same workflows but with `repo-id=gerchowl/mat-vis` and the real CalVer (e.g. `v2026.04.2`):

gh workflow run bake.yml -f release-tag=v2026.04.2 -f repo-id=gerchowl/mat-vis ...
gh workflow run derive.yml ... -f repo-id=gerchowl/mat-vis ...

Post-prod:

  1. Re-run phase 5 (clients) against the prod tag — should pass without code change
  2. Drop the `MAT_VIS_LIVE_TESTS=1` skip on the package live tests (they were gated off until prod is per-file; feat(client/python): migrate fetch_texture to per-file plain GET (#186) #196 commented this)
  3. Trigger M1: Single-material smoke test (ambientcg, 1 material, 1K) #5 — PyPI cut (`just release-check && twine upload …`)

Cleanup at any failure point

# Drop all v0.0.0-phase* branches
uv run --with huggingface_hub python -c "
from huggingface_hub import HfApi; from pathlib import Path
api=HfApi(token=Path('~/.cache/huggingface/token').expanduser().read_text().strip())
for b in api.list_repo_refs(repo_id='gerchowl/mat-vis-tst', repo_type='dataset').branches:
  if b.name.startswith('v0.0.0-phase'):
    api.delete_branch(repo_id='gerchowl/mat-vis-tst', repo_type='dataset', branch=b.name)
    print('deleted', b.name)
"
MAT_VIS_AUDIT_FORCE=1 uv run --with-editable . --extra baker mat-vis-baker audit-orphans --repo gerchowl/mat-vis-tst --delete

"We're ready for v0.6.0 release workflow" checklist

  • Phase 1 green (3 sources, 1 material each)
  • Phase 2 green (3 sources, full catalog each, sequential)
  • Phase 3 green (matrix bake with CAS retry observed in logs)
  • Phase 4 green (derive matrix, manifest accumulates derived tiers)
  • Phase 5 green (all four clients pass against staged tag)
  • Phase 6 green or skipped with reason
  • `mat-vis-tst` cleaned post-staging (only `main`, 0 LFS)
  • No new follow-up issues with severity > P2 opened during staging

When every box is ticked, advance #179 to "ready to merge dev → main" and proceed with phase 7 (prod bake).

Notes for handoff after compact

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions