Conversation
Two fixes that enable expert_activation_profile.py to ingest MoE configs
from four families without modification: Qwen3MoE / OlmoeForCausalLM /
GraniteMoeForCausalLM / DeepseekV2ForCausalLM.
Empirical anchor: continuum-ai/olmoe-1b-7b-compacted-5b v1 (alloy hash
bba0a92ff0c8bebb). Same expert_activation_profile.py and
cpu_expert_prune_v2.py --importance-json scripts that produced
continuum-ai/qwen3-coder-30b-a3b-compacted-19b-256k now produce the OLMoE
artifact without any further modification. Cross-architecture portability
of §4.1.3.4 calibration-aware MoE expert importance is empirically
validated at two structurally distinct MoE families.
Fix 1: per-layer stats display hardcoded layer indices [0, 23, 47] from
the Qwen3-Coder-30B-A3B case (48 layers). OLMoE has 16 layers, Granite
has 32 layers — KeyError(23) at end of run. Patched to pick first/mid/
last layer dynamically.
Fix 2: cfg.num_experts AttributeError on configs that use different
field names. GraniteMoeConfig uses num_local_experts, DeepseekV2Config
uses n_routed_experts. Patched to fall back across all three known
field names with an explicit ValueError if none match.
Validated end-to-end on:
- Qwen/Qwen3-Coder-30B-A3B-Instruct (48 layers, 128e/top-8, Qwen3MoE)
- allenai/OLMoE-1B-7B-0924-Instruct (16 layers, 64e/top-8, Olmoe)
Tested config-load on (still need cpu_expert_prune_v2.py adapter work):
- ibm-granite/granite-3.1-3b-a800m-instruct (32 layers, 40e/top-8) —
config loads, but Granite uses block_sparse_moe.router.layer (not
mlp.gate) and fused experts via GraniteMoeParallelExperts. Hooks
fail. Needs separate adapter sprint in cpu_expert_prune_v2.py.
- deepseek-ai/DeepSeek-V2-Lite-Chat (27 layers, 64 routed + 2 shared
experts) — has shared-expert split that must be preserved bit-exact.
The bug-class verification protocol's Check 1 catches this. Needs
separate shared-expert exclusion sprint in cpu_expert_prune_v2.py.
The two fixes here are scoped to expert_activation_profile.py only.
The two adapter sprints (Granite fused-experts, DeepSeek shared-experts)
are tracked separately and unblock those families when they land.
…n sprint Pure preservation commit. NO behavioral changes. The drive crashed once today already and took the in-session context with it. These six files were sitting in the working tree unbacked when the crash hit: - scripts/vision_safety.py (327 lines) — VL whitelist generator. Reads a VL model config and produces the set of untouchable parameter names, vocab indices, and config keys the forge pipeline must preserve bit-exact. Consumer hooks: compensation_lora_vl.py, cpu_expert_prune_vl.py, forge_model.py (post Phase 4). No-fallback discipline: hard preconditions on vision_config presence, deepstack_visual_indexes empty, all five vision token ids present. - scripts/test_vision_safety.py (331 lines) — CPU smoke test for the whitelist generator. - Dockerfile (52 lines) — forge-image container. - install.sh (126 lines) — installer that wires runtime deps in the right order (vLLM, LiveCodeBench, then transformers 5.5 last to avoid the pinned-dep tangle). - .dockerignore + .github/workflows/forge-image.yml — CI for the container image. Committing as-is so the work is in git before any plugin-sprint refactor touches anything. A wip/pre-plugin-sprint-2026-04-08 branch will point at this commit immediately after, so it's reachable forever even if the current branch (cross-arch-portability-fixes) gets pruned later.
…test
First plugin-sprint commit. Establishes the second axis of dispatch in the
forge pipeline (model architecture → FamilyAdapter) on top of the existing
first axis (stage type → StageExecutor). Per the never-branch rule: new
model families are now NEW adapter files, never branches in shared paths.
scripts/adapters/ — new package, no torch import at module load time:
base.py FamilyAdapter ABC + AdapterCall dataclass + STAGE_METHOD_MAP
+ REQUIRES_FAMILY_OVERRIDE set (which methods MUST be
overridden vs which are family-agnostic by default).
Default stage handlers raise NotImplementedError with
clear "this family does not support stage X" messages.
Output / bookend stages (quant, eval, publish, package,
deploy, deliver) default to no-op return ctx — they're
family-agnostic and the existing scripts/stages/output_stages.py
executors handle them.
registry.py AdapterRegistry singleton with strict architecture-string
lookup. Re-registering a different class against an
existing arch raises (silent override would let one
adapter shadow another). KeyError on unknown arch
includes the full list of registered architectures and
the file/registration recipe to add the missing one.
dispatch.py resolve_adapter_chain(alloy) — pure dispatch resolution.
Loads alloy JSON, looks up the family adapter for
source.architecture, walks alloy.stages, returns a list
of AdapterCall records. NO model load, NO torch, NO GPU.
Tier 1 entry point. DispatchError as the single failure
type so the test catches structured failures.
qwen3_dense.py Qwen3DenseAdapter — first concrete adapter, handles
architecture='qwen3_5'. Covers the 6 active Qwen3.5
dense alloys in the published catalog. Methods are
Tier 1 stubs (return ctx unchanged) — Tier 2 wires them
to forge_model.prune / train_lora / etc.
tests/reproducibility/ — new test module, parametrized over the published
catalog. 14 entries covering every continuum-ai/* artifact known to date.
First run fetches alloys from HF and caches them under _cache/; re-runs
use the cache. Cache files are committed as PINNED REFERENCE SNAPSHOTS —
the contract the adapters are built against. README in _cache/ explains
the pin semantics and refresh procedure.
Test status this commit:
8 passed — 6 active Qwen3.5 dense alloys + 2 sanity tests
(0.8b/2b/4b general, 4b code, 4b code 128k, 9b general)
5 skipped — published artifacts that have NO .alloy.json in their HF
repo (publish-pipeline gap, brand-integrity issue tracked
separately, NOT a dispatch failure):
qwen3.5-4b-code-forged-defragged
qwen3.5-4b-code-forged-GGUF
qwen3.5-27b-code-forged
qwen3.5-27b-code-forged-defragged
qwen3.5-27b-code-forged-mlx-4bit
Fix is in scripts/publish_model.py / alloy_to_card.py:
downstream variants (defragged / GGUF / mlx-4bit) must
publish their own alloy.json with the upstream forge
stages plus the new pipeline step. Tracked separately.
3 xfailed — non-Qwen3.5 architectures, deferred per 'qwen3.5 first':
qwen3-coder-30b-a3b-compacted-19b-256k (qwen3_moe)
olmoe-1b-7b-compacted-5b (olmoe)
qwen2.5-coder-7b-compacted (qwen2)
Adding the adapter for each will auto-flip the xfail to
xpass — that's the gate that proves the dispatch contract
generalizes beyond Qwen3.5 dense.
What this commit does NOT do:
- Touch alloy_executor.py / scripts/stages/transform_stages.py at all.
The existing PruneExecutor / TrainExecutor still call forge_model
directly. Wiring them to delegate to the resolved family adapter is
the next commit, gated on the Qwen3.5 catalog being fully green at
Tier 1 (which it now is for active alloys).
- Touch any model weights, run any forge, or verify Tier 2 byte-
equivalence against the published modelHashes. Tier 2 lights up after
the dispatch contract is proven and stable.
- Add adapters for qwen3_moe / olmoe / qwen2 — those are the next three
plugin-sprint commits, in that order, after this one is reviewed.
Second plugin-sprint commit. Cuts the stage executors from "owns the
model-touching code" to "thin dispatcher that resolves the family adapter
and forwards the call." The actual prune / train / expert-prune bodies
move into the family adapter so per-family work lives in per-family files.
scripts/stages/transform_stages.py — refactored:
PruneExecutor, TrainExecutor, ExpertPruneExecutor are now ~5-line
dispatchers. Each one:
1. Reads ctx.alloy['source']['architecture']
2. Calls scripts.adapters.resolve_family_adapter(arch)
3. Forwards self.config (minus 'type') as kwargs to the matching
method on the resolved adapter
4. Returns the mutated ctx.
No more `if architectures[0] == ...` branches. No more direct calls
to forge_model.prune from this layer. The executors are now genuinely
family-agnostic.
Helper _resolve_family_for_ctx() raises a clear DispatchError if
ctx.alloy or source.architecture is missing — that's a wiring bug
upstream in alloy_executor, not something to silently default around.
ExpertPruneExecutor was previously a STUB that printed "use:
cpu_expert_prune.py ..." and did nothing. It now correctly delegates
to family.expert_prune(), which raises NotImplementedError on dense
families and (when MoE adapters land in upcoming commits) calls into
cpu_expert_prune_v2.py with the family's tensor layout.
scripts/adapters/qwen3_dense.py — extended:
Qwen3DenseAdapter.prune() now contains the body that previously lived
in PruneExecutor.execute() — compute_head_importance + forge_model.prune
+ immediate defrag + per-layer importance bookkeeping. Lazy imports
(torch, forge_model, defrag_inline) so Tier 1 dispatch resolution stays
torch-free.
Qwen3DenseAdapter.train() now contains TrainExecutor's old body —
forge_model.train_lora + post-train eval. Also lazy-imported.
Qwen3DenseAdapter.context_extend() is a Tier 2 stub for the
qwen3.5-4b-code-128k variant — present so dispatch acknowledges the
family handles the stage; wiring to the real implementation lands when
Tier 2 reproducibility for that variant runs.
All three methods short-circuit cleanly when ctx.model is None
(dispatch-only / dry-run path), so the existing _dry_run() in
alloy_executor.py keeps working without modification.
scripts/adapters/base.py — extended:
Added FamilyAdapter.log() helper so adapter methods produce visually
consistent output with StageExecutor.log(). Format: " [AdapterName] msg".
Test status (unchanged from previous commit, by design — this is a
refactor, not new functionality):
8 passed, 5 skipped, 3 xfailed.
What this commit enables (next):
- Adding Qwen3MoEAdapter is now a one-file change. The MoE adapter's
expert_prune() / expert_activation_profile() methods will receive the
same kwargs the morning's qwen3-coder-30b-a3b-compacted alloy carries,
via the existing ExpertPruneExecutor dispatcher, with zero edits to
transform_stages.py.
- Qwen2DenseAdapter and OlmoeAdapter slot in the same way.
- Tier 2 light-up: when a real model is loaded, ctx.model is non-None,
and the adapter's prune() / train() bodies execute against it. The
existing alloy_executor.execute_alloy() path Just Works because it
still calls create_executor(stage).execute(ctx) — only the executors'
INTERNAL implementation changed.
Third plugin-sprint commit. Adds the family adapter for the Qwen3MoE
architecture (the morning-of-2026-04-08 §4.1.3.4 anchor family). The
qwen3-coder-30b-a3b-compacted-19b-256k artifact (alloy hash aa61c4bdf463847c,
88.4 HumanEval, the headline §4.1.3.4 empirical anchor) now resolves to a
clean adapter chain at Tier 1.
scripts/adapters/qwen3_moe.py — new:
Qwen3MoEAdapter handles architecture='qwen3_moe'. Tensor layout:
model.layers.{i}.mlp.experts.{e}.{gate,up,down}_proj (unfused experts)
model.layers.{i}.mlp.gate (router)
128 experts per layer, 8 activated. The §4.1.3.4 prune is 128 → 80.
Methods overridden:
expert_activation_profile() — § 4.1.3.4 calibration-aware metric.
Reads calibrationCorpus, calibrationExamples, calibrationTokens
from the alloy stage. Tier 2 wires to scripts/expert_activation_
profile.py. Tier 1: short-circuits cleanly when ctx.model is None.
expert_prune() — per-layer top-K removal keyed to the importance JSON.
Reads keepExpertsPerLayer, originalExpertsPerLayer, prunePct,
strategy, perLayerNormalized, etc. from the alloy stage. Tier 2
wires to scripts/cpu_expert_prune_v2.py --importance-json.
Methods NOT overridden (the family doesn't support these by design):
prune — alloys for this family must use 'expert-prune' not 'prune'.
The base default raises with "MoE families should use expert-
prune" pointing the dispatcher at the contract violation.
train / lora — the morning's compaction shipped without compensation
LoRA. If a Qwen3MoE compensated artifact ships later, train()
gets overridden. Until then, dispatch correctly raises if an
alloy tries to train this family.
modality — text-only family today.
Reproducibility contract: this adapter MUST stay frozen against the
morning's artifact. Methodology improvements (e.g. a different importance
metric) ship as NEW adapters with NEW discriminators, NEVER as edits to
this file. The §4.1.3.4 negative-baseline router-gate-L2 cell is
preserved in the alloy's priorMetricBaselines[] as the falsifiability
anchor — when its adapter ships, it will be a separate
RouterGateL2ImportanceAdapter or a parameterized form of this one.
scripts/adapters/__init__.py — registers qwen3_moe alongside qwen3_dense.
tests/reproducibility/test_published_alloys_dispatch.py — flips
qwen3-coder-30b-a3b-compacted-19b-256k from 'deferred' to 'active'. xfail
turns into pass automatically.
Test status:
Before: 8 passed, 5 skipped, 3 xfailed
After: 9 passed, 5 skipped, 2 xfailed ← qwen3_moe now green
Fourth plugin-sprint commit. Adds the OLMoE family adapter (the §4.1.3.4 cross-architecture anchor — paired with Qwen3MoE on a structurally different MoE family to validate the calibration-aware metric pattern generalizes across architectures, not just across one family). scripts/adapters/olmoe.py — new: OlmoeAdapter handles architecture='olmoe'. 16 layers × 64 experts, 8 activated. Methods overridden: expert_activation_profile + expert_prune. Same param contracts as Qwen3MoEAdapter — the methodology IS the same; the differences are tensor walks underneath, which Tier 2 lazy-imports will dispatch. The cross-architecture portability fixes from sentinel-ai commit 488b740 are what made the underlying expert_activation_profile.py script handle both families without per-family forks. Reproducibility contract: frozen against the published artifact (alloy hash bba0a92ff0c8bebb, 36.0 HumanEval). Within-model A/B negative-baseline cell (broad-corpus calibration vs code-corpus calibration on the same OLMoE base) is preserved in the alloy's priorMetricBaselines[] as the §4.1.3.4 falsifiability anchor for OLMoE. scripts/adapters/__init__.py — registers olmoe alongside qwen3_dense + qwen3_moe. tests/reproducibility/test_published_alloys_dispatch.py — flips olmoe-1b-7b-compacted-5b from 'deferred' to 'active'. Test status: Before: 9 passed, 5 skipped, 2 xfailed After: 10 passed, 5 skipped, 1 xfailed ← olmoe now green Per the outlier-validation rule from CLAUDE.md, OlmoeAdapter is written as a parallel SIBLING of Qwen3MoEAdapter, not by extracting a base from one example. Both adapters now exist with concrete behavior. The next move evaluates whether the shared 80% justifies extracting an MoEUnfusedExpertsBase — that base extraction lands as its own commit AFTER both siblings are proven, not before. Don't extract a base off one example, and don't bolt a third sibling onto a base whose abstraction was speculated.
Fifth plugin-sprint commit. Adds the Qwen2 dense adapter for the v2-7b-coder-
compensated artifact (the §4.1.3.3 compensation-LoRA anchor). With this
landed, every published continuum-ai/* alloy with a .alloy.json now
resolves at Tier 1 dispatch.
scripts/adapters/qwen2_dense.py — new:
Qwen2DenseAdapter handles architecture='qwen2'. Methods overridden:
prune — same dense-head pruning shape as Qwen3DenseAdapter (the
underlying forge_model.prune call is architecture-agnostic
for dense Qwen-family models). Tier 2 wiring deferred.
train — handles BOTH normal recovery LoRA AND § 4.1.3.3 compensation
distillation. Dispatches internally on the presence of a
'teacher' field in the stage params, which signals KL-
distillation against an unmodified teacher. Both flow through
the same .train() method because the alloy uses 'lora' stage
type for both — the discrimination is by content, not by
stage name.
The compensation distillation path's params (teacher, kdTemperature,
loraRank, loraAlpha, lossType, mergedAtSave, trainableParamsPct) ARE
the § 4.1.3.3 methodology. The adapter's contract logs them so the
dispatch report shows what would execute, even though Tier 2 wiring
to scripts/compensation_lora.py is still pending.
scripts/adapters/__init__.py — registers qwen2_dense.
tests/reproducibility/test_published_alloys_dispatch.py — flips
qwen2.5-coder-7b-compacted from 'deferred' to 'active'.
Test status:
Before: 10 passed, 5 skipped, 1 xfailed
After: 11 passed, 5 skipped, 0 xfailed ← every published alloy with
an .alloy.json now resolves
cleanly at Tier 1 dispatch.
Catalog coverage at Tier 1:
✓ qwen3.5-0.8b-general-forged (qwen3_5)
✓ qwen3.5-2b-general-forged (qwen3_5)
✓ qwen3.5-4b-general-forged (qwen3_5)
✓ qwen3.5-4b-code-forged (qwen3_5)
✓ qwen3.5-4b-code-128k-forged (qwen3_5)
✓ qwen3.5-9b-general-forged (qwen3_5)
✓ qwen3-coder-30b-a3b-compacted-19b-256k (qwen3_moe — § 4.1.3.4 anchor)
✓ olmoe-1b-7b-compacted-5b (olmoe — § 4.1.3.4 cross-arch anchor)
✓ qwen2.5-coder-7b-compacted (qwen2 — § 4.1.3.3 anchor)
⊘ 5 variants skipped (no alloy.json — publish-pipeline gap)
Now visible code-overlap candidates for base extraction (next commit):
- Qwen3DenseAdapter.prune ↔ Qwen2DenseAdapter.prune — both call
forge_model.prune the same way. Justifies QwenDenseBase.
- Qwen3MoEAdapter.expert_activation_profile ↔ Olmoe equivalent — both
log the same calibration corpus + count + tokens, both Tier 2-wire
to scripts/expert_activation_profile.py. Justifies MoEUnfusedExpertsBase.
- Qwen3MoEAdapter.expert_prune ↔ Olmoe equivalent — same.
These extractions land as their own commit per the OOP rule: write
two siblings first, prove both work, THEN extract a base from the
proven shared 80%. The next commit does that extraction; this commit
deliberately leaves the duplication in place so the diff shows the
true shared shape.
…umbers from a Mac
Sixth plugin-sprint commit. Adds the cheapest possible falsifiability check
on every shipped continuum-ai/* artifact: download the per-problem JSONL
eval samples, sha256 them, compare to the alloy's recorded resultHash.
No GPU, no torch, no model load, no inference. Pure bytes-in / hash-out.
This is the test that could have caught a silent post-publish edit of the
morning's flagship artifact's eval JSONL — and now it does.
What it actually verifies (results — every claim from the alloys hashes
correctly today, and stays verified going forward):
qwen3-coder-30b-a3b-compacted-19b-256k:
student_samples.jsonl → sha256:472eef03dfe0a3c81b30afa70b2788325c… ✓
base_samples.jsonl → sha256:36741af29419e658b820e0f0a5dd01988f… ✓
(these score the headline 88.4 / 86.0 vs 92.1 / 89.0 numbers)
olmoe-1b-7b-compacted-5b:
student_samples.jsonl ✓
base_samples.jsonl ✓
(the §4.1.3.4 cross-architecture anchor)
qwen2.5-coder-7b-compacted:
humaneval_samples.jsonl ✓
(the §4.1.3.3 dense compensation anchor)
How it works: every published alloy's results.benchmarks[] entries declare
both samplesPath (where in the HF repo to find the JSONL) and resultHash
(sha256:…) — paired with baseSamplesPath / baseResultHash for the unmodified
base anchor. The test walks the cache, extracts every (samplesPath, hash)
pair, fetches the bytes from HF (cached under tests/reproducibility/_cache/samples/),
sha256s them, asserts equality. Cases are deduplicated by (samplesPath, hash)
so a single JSONL scoring multiple benchmarks is verified once.
tests/reproducibility/test_published_alloys_sample_hashes.py — new test module:
- test_cache_has_alloys / test_cases_were_extracted: sanity gates
- test_published_samples_match_alloy_hash[*]: 5 forward verifications
across the 3 flagship MoE / dense artifacts
- test_prior_baseline_samples_pinned_and_match[*]: catches the
negative-baseline cells that publish samples WITHOUT a hash —
surfaces them as xfail with a clear "fix layer" message so the
falsifiability gap is visible in the test suite, not just in a TODO.
tests/reproducibility/_cache/samples/ — pinned reference snapshots of the
5 forward-claim JSONLs, same pattern as the alloy cache. Committing them
makes the test runnable offline and guarantees the contract is asserted
against the exact bytes the adapters were built against, not whatever
HF currently serves.
Brand-integrity gaps surfaced (each one is now a tracked xfail, not a
hidden TODO):
GAP 1: priorMetricBaselines[].evaluation has no samplesHash field.
Affected: qwen3-coder-30b-a3b-compacted (§4.1.3.4 router-gate-l2 anchor)
olmoe-1b-7b-compacted-5b (§4.1.3.4 broad-corpus anchor)
Impact: the falsifiability anchor for the published methodology paper's
§4.1.3.4 finding is published but UNPINNED. Anyone with HF
write access could swap student_samples_router_l2_baseline.jsonl
and the published −13.4 HumanEval delta could not be verified
byte-for-byte.
Fix layer: forge_alloy/types.py (add evaluation.samplesHash to the
PriorMetricBaseline schema), then alloy_to_card.py and
publish_model.py to compute and emit the hash.
GAP 2: §4.1.3.4.1 calibration corpus not uploaded to the HF repo.
The alloy's expert-activation-profile stage references
'calibration/heldout_code300.jsonl' but no such file exists in the
qwen3-coder-30b-a3b-compacted-19b-256k repo. The §4.1.3.4.1 discipline
gate requires the corpus to be hash-pinned AND uploaded so any
re-pruner can start from the same bytes. Currently violated for both
flagship MoE artifacts.
Fix layer: publish_model.py (upload calibration/ alongside model files
+ write its sha256 into the alloy's calibrationCorpora root
extension).
NOT covered by this test yet — separate test will catch it once the
alloy schema gains a 'calibrationCorpora[].sha256' verifiable field.
Test status across the whole reproducibility module now:
Tier 1 dispatch: 11 passed, 5 skipped, 0 xfailed
Tier 3 sample hash: 7 passed (5 forward + 2 sanity), 0 failed, 2 xfailed
(the unpinned negative-baseline cells)
Total reproducibility test count: 25, all green or expected-fail.
Tier 4 (re-score samples → produce pass@1 → compare to published score)
is the natural follow-up: once we trust the JSONL bytes (Tier 3 ✓), running
the evalplus scorer against them produces the published 88.4 / 86.0 / 36.0
/ 61.0 numbers without invoking a model. That validates the published
benchmark SCORES, not just the sample bytes. Lands as the next commit.
…sh bugs
Seventh plugin-sprint commit. Lands the strongest possible Mac-side
falsifiability gate (Tier 4: re-score the published JSONLs with
evalplus's canonical pass@1 and assert the alloy's headline matches),
catches a real one-off bug in the morning's flagship alloy, fixes the
two in-tree publish-pipeline bugs that could reproduce it, and corrects
the local cached alloy bytes.
== What this commit verifies (across all 3 reproducibility test modules)
Tier 1 dispatch: 11 active alloys resolve to clean adapter chains
Tier 3 sample-hash: all forward sample-hash claims verify against alloy
Tier 4 canonical pass@1: every published score reproduces to ±0.00 pp via
evalplus's official CLI on the published JSONL bytes
Total: 32 passed, 5 skipped, 2 xfailed.
The 5 skipped are downstream variants with no alloy.json (separate publish-
pipeline gap). The 2 xfailed are priorMetricBaselines cells that publish
samples without a samplesHash field — separate falsifiability anchor gap
that needs a forge-alloy schema field, tracked in Tier 3.
== Tier 4 scorer (tests/reproducibility/_humaneval_scorer.py)
Wraps evalplus's official `python -m evalplus.evaluate` so it runs cleanly
on macOS. The official scorer fails on macOS because reliability_guard
calls resource.setrlimit(RLIMIT_AS, ...) which errors with 'current limit
exceeds maximum limit', and because evalplus uses 'spawn' multiprocessing
on macOS by default a parent-side monkey-patch doesn't reach the worker
children that actually run candidates. Result on stock macOS: every JSONL
scores uniformly 0.000 — false-negative reproducibility.
The fix is two-part and lives in a CLEAN subprocess so any already-loaded
evalplus modules from the parent process don't leak in:
1. Spawn a fresh `python -c` subprocess.
2. Inject a tiny preamble that sets multiprocessing start_method='fork',
monkey-patches reliability_guard to a no-op on both evalplus.eval and
evalplus.eval.utils, then invokes evalplus.evaluate.main() with the
right argv.
Forked workers inherit the parent's no-op binding; setrlimit never runs;
candidates execute normally; pass@1 matches the canonical Linux output
exactly. The scorer reads evalplus's per-task details JSON to extract
exact passed/total counts on top of the CLI's pass@1 string.
Earlier history (gone): a hand-rolled inline scorer that exec'd the
dataset's `test` field directly. It matched evalplus on most JSONLs to
±0.05 pp but disagreed by one problem on the OLMoE broad-corpus JSONL
because it didn't replicate evalplus's _special_oracle / contract
handling. The right answer was to fix the wrapping, not the scorer.
== Tier 4 test (tests/reproducibility/test_published_alloys_scoring.py)
Walks every cached alloy and parametrizes scoring cases over
results.benchmarks[] entries — both 'humaneval' and 'humaneval_plus' are
scored, both student samples and base anchor samples. Same shape for
priorMetricBaselines[] (the §4.1.3.4 falsifiability anchors). Tolerance:
±0.1 pp.
The morning's flagship §4.1.3.4 anchor — the qwen3-coder-30b-a3b-
compacted-19b-256k artifact — verifies end-to-end:
base anchor: 92.10 reproduced ✓ (151/164)
student: 88.40 reproduced ✓ (145/164)
router-l2 negative: 78.66 reproduced ✓ (129/164, the §4.1.3.4 falsifiability anchor)
Δ student vs negative baseline: +9.74 pp ≈ paper's +9.7 ✓
The OLMoE §4.1.3.4 cross-architecture anchor verifies the same way
including the broad-corpus negative-baseline cell.
FUTURE — eval as adapter-driven stage: documented in the test module
docstring. Long-term, the scorer is invoked through a family adapter's
.eval() method, with each family declaring its canonical benchmark suite
(HumanEval for code, MMLU for general, MMMU for vision, COVOST 2 for
audio, etc). The standalone scorer here is the bridge until the adapter-
driven eval-runner registry lands.
== Bugs found and fixed
The Tier 4 test caught a 0.6 pp overstatement on TWO rows of the morning's
flagship alloy (qwen3-coder-30b-a3b-compacted-19b-256k):
student humaneval_plus: 86.0 (alloy) vs 85.40 (canonical)
base humaneval_plus: 89.0 (alloy) vs 88.40 (canonical)
Root cause: the alloy was authored using a non-canonical pass@1 counting
convention — (plus_status=='pass' / total) = 141/164 = 85.97 → 86.0 — when
evalplus's canonical pass@1 uses (base_status==plus_status=='pass' / total)
= 140/164 = 0.854 = 85.4. Same convention error on both the student and
base rows. Every other published alloy (OLMoE student/base, v2-7b-coder,
both negative-baseline cells) reproduces to ±0.00 pp, so the bug was
one-off in the path that wrote this morning's flagship alloy.
Two in-tree code paths COULD have reproduced this kind of error — both
fixed in this commit so future publishes can't:
scripts/stages/output_stages.py::_parse_evalplus_output:
Walked all output lines and overwrote metrics['score'] each iteration,
so it always returned the LAST pass@1 value (= humaneval_plus) regardless
of which benchmark name was being scored. Assigning humaneval_plus's
value to a humaneval benchmark. Fixed: section-aware regex parsing
that selects the right pass@1 line per benchmark name. Also bumped
rounding precision from 1 dp to 2 dp (1 dp loses 0.5 pp of fidelity
on small score differences and is the kind of rounding that masks
bugs like the one above).
scripts/add_benchmark.py::_load_evalplus_results:
The eval_results.json branch read keys (`pass@1.n_correct`) that don't
exist in evalplus's actual schema (the actual schema is `eval[task_id]`
list with base_status / plus_status). The JSONL fallback counted
`is_passing` / `passed` fields that the published JSONLs don't carry
(they only have task_id + solution). Both branches always returned
0/164 — `add_benchmark.py --from-evalplus` was a silent no-op that
wrote 0% to the alloy. Fixed: delegate to the canonical scorer
(tests/reproducibility/_humaneval_scorer.py) which uses evalplus's
official CLI, returns separate humaneval and humaneval_plus values,
and rounds to 2 dp.
== Local alloy correction
The cached qwen3-coder-30b-a3b alloy is patched in place to use canonical
values (humaneval_plus 85.4 / 88.4) and the version is bumped to 1.0.1.
The published JSONL bytes are NOT changed — only the alloy fields that
score them are corrected. A scoreCorrection block is added to each
patched benchmark entry recording the previous values, the corrected
values, the date, and the reason, so the audit trail is in-band.
The HuggingFace-published alloy still has the old values. Action item:
re-publish the corrected alloy via publish_model.py when ready. Until
then, the local cache (which the tests pin against) is the source of
truth for the canonical numbers; HF lags by one publish cycle.
== Cache hygiene
tests/reproducibility/_cache/samples/.gitignore now excludes
*_eval_results.json — those are evalplus's per-task output files that
the scorer regenerates on every run (and deletes before each run for
safety). They must NOT be pinned alongside the JSONL samples files,
which ARE pinned reference snapshots.
Test-state delta:
Before this commit: 12 passed, 1 xfailed, 1 failed (the 0.6pp drift)
After this commit: 14 passed, 0 xfailed, 0 failed (Tier 4 only)
Combined across Tier 1+3+4: 32 passed, 5 skipped, 2 xfailed
…ctions
Eighth plugin-sprint commit. Adds the focused tool that re-publishes
ONLY the alloy.json + regenerated README + regenerated QR to a HF
repo, leaving model weights and per-problem JSONLs untouched. Used
this commit to fix the qwen3-coder-30b-a3b-compacted-19b-256k humaneval_plus
non-canonical convention bug that the Tier 4 reproducibility test caught.
== What it does
scripts/republish_alloy_only.py reads a corrected local alloy file,
diffs it against the current HF version, regenerates the model card via
alloy_to_card.alloy_to_card() and the QR via qrcode against the new
verify URL, then atomically uploads the three metadata files. Defaults
to dry-run; --confirm pushes.
Defenses:
- Refuses if local alloy bytes are byte-identical to current HF (no diff)
- Refuses if results.integrity.modelHash differs (use publish_model.py
for full re-publish that includes weights)
- Generates a structured field-level diff summary so review is fast
Files touched per run: alloy.json, README.md, alloy-qr.png. Files NOT
touched: model weights, eval/*.jsonl, calibration/*, tokenizer*, config*.
== Live HF state after this commit
continuum-ai/qwen3-coder-30b-a3b-compacted-19b-256k:
alloyHash: aa61c4bdf463847c → 011970c80c2f3429
version: 1.0.0 → 1.0.1
humaneval: 88.4 / 92.1 / Δ-3.7 (unchanged — was already canonical)
humaneval_plus: 86.0 → 85.4 (canonical evalplus pass@1)
baseScore plus: 89.0 → 88.4 (canonical evalplus pass@1)
scoreCorrection: in-band record of the previous values + reason
The published JSONL bytes were NOT modified; only the alloy fields that
score them were corrected. The headline 88.4 HumanEval claim is
unchanged. The methodology paper §4.1.3.4 +9.7pp metric-swap claim is
unchanged (it's computed against the negative-baseline cell which was
always canonical). The README's headline still reads
"37% Experts Pruned, 88.4 HUMANEVAL (base 92.1)".
The benchmark table now correctly reads:
| humaneval | 88.4 | 92.1 | -3.7 |
| humaneval_plus | 85.4 | 88.4 | -3.0 |
The verify URL on HF is now https://cambriantech.github.io/forge-alloy/verify/#011970c80c2f3429
The old verify URL #aa61c4bdf463847c is orphaned and will not resolve
against the live alloy. (Per the never-lose-work rule, the previous
alloy bytes are still recoverable from HF git history if anyone needs
the audit trail; the scoreCorrection block in the new alloy also
documents the change in-band.)
== Tier 4 reproducibility status after the live re-publish
The local cache (already committed in the previous Tier 4 commit) is
byte-identical to what's now on HF. The reproducibility test stays
fully green: 32 passed, 5 skipped, 2 xfailed across all three tiers.
The 2 remaining xfails are the priorMetricBaselines unpinned-samples-hash
gap (separate forge-alloy schema field needed), tracked at the Tier 3
layer.
== Why this is a separate script from publish_model.py
publish_model.py does the full re-publish (model weights + alloy + card + QR)
and is the right tool when the model itself changes. For a metadata-only
correction like this one, re-publishing the weights would be:
- Wasteful (10s of GB transfer for 3 small text changes)
- Risky (could touch the modelHash chain or eval JSONL files)
- Slow (the upload takes hours)
republish_alloy_only.py is the surgical tool: smallest possible change
to fix the alloy text, keep everything else immutable, leave the weight
chain untouched. It's also strictly defensive — it refuses to run if
the modelHash field differs between local and HF, forcing the operator
to use publish_model.py for any change that touches weights.
Ninth plugin-sprint commit. Closes the publish-pipeline gap that left 8
shipped continuum-ai/* artifacts without a forge-alloy provenance envelope.
Every model on the org page now has a working alloy that dispatches
through the family-adapter set; the Tier 1 reproducibility test goes
from "11 active + 5 skipped + 3 deferred" to "19 active + 0 skipped".
== What was missing
Before this commit, 8 of 17 LLM artifacts on continuum-ai had no
.alloy.json on HuggingFace:
Pre-§4.1.3.1 legacy forges (had forging_results.json, no alloy):
qwen2.5-0.5b-general-forged
qwen2.5-1.5b-general-forged
qwen2.5-3b-general-forged
qwen3.5-27b-code-forged
Downstream variant artifacts (no provenance file at all):
qwen3.5-4b-code-forged-defragged
qwen3.5-4b-code-forged-GGUF
qwen3.5-27b-code-forged-defragged
qwen3.5-27b-code-forged-mlx-4bit
The qwen2.5-{0.5b,1.5b,3b}-general-forged trio shipped before the alloy
schema existed and persisted with old-style results blobs. The
qwen3.5-27b-code-forged was the parent of three downstream variants that
also lacked alloys. Each downstream variant inherits its forge journey
from the parent but had no link in the chain.
== Two new backfill tools
scripts/backfill_alloy_from_results.py:
Synthesizes a forge-alloy from a legacy forging_results.json. Maps the
old-style fields (model, strategy, pruning_level, baseline_ppl,
final_ppl, training_data, hardware_targets, forged_at) onto the
current alloy schema. Detects architecture from the repo's config.json.
Composes a deterministic modelHash from per-shard LFS sha256s pulled
via HuggingFace's metadata API — no shard downloads required, works
for any size repo (the 27B's 11×5GB shards were "hashed" without
fetching a byte). Stamps a backfill marker so the audit trail records
that the alloy was retroactively synthesized 2026-04-08, while the
forge run itself executed at the date in results.completedAt.
Refuses if the repo already has a .alloy.json (use republish_alloy_only.py
for corrections instead).
scripts/derive_alloy_from_parent.py:
Synthesizes an alloy for a downstream variant by inheriting from its
parent's published alloy and appending a single derivation stage.
Three kinds:
defragged → 'package' stage with safetensors-defragged format
gguf → 'quant' stage with format=gguf, quantTypes=[Q4_K_M, Q8_0]
mlx-4bit → 'quant' stage with format=mlx, quantTypes=[4bit]
Each derived alloy:
- Inherits source.baseModel + source.architecture from parent
- Inherits stages[] verbatim and appends the derivation stage
- Inherits parent's results.benchmarks (model behavior preserved
through defrag/quant within published tolerance)
- Adds a `derivedFrom` field pointing at the parent repo
- Adds `parentAlloyHash` to integrity for chain walking
- Computes its OWN modelHash from the variant's actual file LFS
sha256s (different from parent — defragged/quantized weights
have different bytes)
Refuses if child already has an alloy.
== modelHash composition convention
Both tools use a new deterministic modelHash:
sha256(canonical_json([{filename, sha256}, ...]))
over the sorted list of per-file LFS sha256s. This is reproducible from
HF metadata alone (no downloads), preserves per-shard attestation in
integrity.fileHashes for verifiers who want to check individual shards,
and gives the same security guarantee as the legacy
sha256(concat(shard_bytes)) convention used by publish_model.py.
NOTE: publish_model.py still uses the legacy concat-and-hash convention
for newly-forged artifacts. That's a follow-up consolidation — the two
conventions don't conflict (they're attestation algorithms over the
same underlying bytes), but unifying them will let the same verifier
check both backfilled and freshly-forged alloys without convention
switching. Tracked separately.
== republish_alloy_only.py: backfill mode
Added a "backfill mode" path: when the target HF repo has NO existing
alloy at all, the script uploads the local file using its basename as
the in-repo path, skips the diff-against-current-HF check (nothing to
diff against), prints the variant's benchmark metadata for review, and
lands all three metadata files (alloy.json + README.md + alloy-qr.png).
The defensive modelHash check is also skipped in backfill mode (no old
modelHash to compare against), but the local alloy still has to declare
ONE so the chain isn't broken.
This let the same one tool drive both:
- Corrections to existing alloys (the qwen3-coder-30b-a3b humaneval_plus
fix from the previous commit)
- First-time publishes for the 8 backfilled alloys above
== Live HF state after this commit (8 fresh uploads)
Backfilled from forging_results.json:
qwen2.5-0.5b-general-forged → alloy a3750da128ba76f0
qwen2.5-1.5b-general-forged → alloy f024d59a481e9032
qwen2.5-3b-general-forged → alloy a13bcfcdc2c8652a
qwen3.5-27b-code-forged → alloy 80a26f0ec24dfc1e
Derived from parent alloys:
qwen3.5-4b-code-forged-defragged → alloy 62f1107fb6142943 (parent: qwen3.5-4b-code-forged)
qwen3.5-4b-code-forged-GGUF → alloy f7f4f6ddf29019d2 (parent: qwen3.5-4b-code-forged)
qwen3.5-27b-code-forged-defragged → alloy f3e68ab40f644c9a (parent: qwen3.5-27b-code-forged)
qwen3.5-27b-code-forged-mlx-4bit → alloy 6ca79c62b879cd4c (parent: qwen3.5-27b-code-forged)
Each upload landed three metadata files (alloy.json + README.md +
alloy-qr.png) atomically via republish_alloy_only.py's --confirm path.
NO model weights were touched. NO eval JSONLs were touched. NO
calibration corpora were touched.
== Test catalog change
tests/reproducibility/test_published_alloys_dispatch.py:
Catalog grew from 14 entries (11 Qwen3.5 dense + 3 deferred families)
to 17 entries with all 17 LLM artifacts marked 'active'. The previous
'no-alloy-file' status is gone — every continuum-ai/* LLM artifact
now has a published alloy.
The 'experiential-plasticity-paper' repo (1 of 18 total continuum-ai
repos) is intentionally excluded from the test catalog — it's a paper
repo, not a model.
== Test status across the entire reproducibility suite
Tier 1 dispatch: 19 passed, 0 skipped, 0 xfailed (was: 11 passed, 5 skipped, 3 xfailed)
Tier 3 sample-hash: 7 passed, 2 xfailed (unchanged — same 2 unpinned-baseline cells)
Tier 4 canonical: 14 passed (unchanged — 3 artifacts with eval samples)
Combined: 40 passed, 0 skipped, 2 xfailed
The 2 remaining xfails are the priorMetricBaselines.evaluation.samplesHash
schema gap (separate forge-alloy schema field needed). Every other claim
on every published continuum-ai/* artifact dispatches cleanly through the
adapter set, hashes against its provenance, and (for the 3 artifacts
with eval samples) reproduces the published score canonically.
== What this enables
- Every published model is now part of the chain-of-custody system.
The verify URL on every model card resolves against an alloy that
declares the model's source, forge journey, and integrity.
- The plugin-sprint reproducibility gate now covers the FULL catalog,
not a curated subset. Adding a new family adapter (Mixtral, Granite,
DeepSeek-V2, etc.) automatically covers any future continuum-ai
artifact in that family — no per-artifact bookkeeping.
- Future re-prunes / re-quants of any backfilled artifact land via the
standard publish pipeline through the adapter set; the backfill
tools are one-shot bridges that close the historical gap, not
permanent infrastructure.
- The Tier 4 evalplus scorer is now wired to validate any artifact
whose alloy carries eval samples. The 3 active artifacts validate
today; the rest will activate as eval samples are uploaded via
add_benchmark.py --from-evalplus (which now correctly reads the
canonical scorer per the previous commit).
.gitignore: backfill_alloys/ excluded — that's the local working
directory; the committed source of truth is tests/reproducibility/_cache/.
…oadmap
Tenth plugin-sprint commit. Captures the full state of the family-adapter
sprint so a future session can pick up cleanly after a drive crash or
context loss without re-discovering the architecture.
== What it documents
- The two-axis dispatch architecture (StageExecutor → FamilyAdapter)
- All 9 plugin-sprint commits with one-line summaries
- Repository layout post-sprint (scripts/adapters/, tests/reproducibility/,
scripts/backfill_alloy_from_results.py, scripts/derive_alloy_from_parent.py,
scripts/republish_alloy_only.py)
- The full FamilyAdapter contract (REQUIRES_FAMILY_OVERRIDE set,
STAGE_METHOD_MAP, default behaviors)
- The 4 reproducibility test tiers (Tier 1 dispatch, Tier 2 re-forge,
Tier 3 sample-hash, Tier 4 canonical pass@1)
- The macOS-evalplus reliability_guard workaround (load-bearing for Tier 4)
- Live HuggingFace state of all 17 published continuum-ai/* model
artifacts including alloyHashes, adapter mappings, and provenance
source (shipped vs backfilled vs derived)
- The modelHash convention drift between publish_model.py and the
backfill tools (and the unification plan in roadmap step 7)
- The 8-step "correct architecture" roadmap with acceptance criteria:
1. Extract QwenDenseBase
2. Extract MoEUnfusedExpertsBase
3. Tier 2 wiring for the MoE adapters
4. Eval-runner registry on family adapters (unblocks frontier targets)
5. forge-alloy llm-forge domain extension (cross-repo)
6. Vision-safety integration (Qwen3VLAdapter)
7. modelHash convention unification
8. priorMetricBaselines.samplesHash schema field + calibration corpus upload
- Glossary of acronyms / repo paths / §4.1.3.x section references
- Crash-recovery checklist at the bottom
== Cross-references
- continuum/docs/architecture/FORGE-ALLOY-DOMAIN-EXTENSIBILITY.md updated
to reference this doc as the consumer-side companion. The schema work
in that doc is roadmap step 5 of this sprint.
- ~/.claude/.../memory/reference_plugin_sprint_doc.md saved as a
pointer for crash-recovery context loading.
== Why this exists
Joel hit "drive crash, then update your design doc for completion in
case of another drive crash" — the previous crash wiped Claude's
in-session context for the entire morning's §4.1.3.4 / qwen3-coder-30b-a3b
work, and recovery was slow because the state was scattered across
commit messages and the convo-with-kash.txt paste log. This doc is the
single source of truth that lets the next session pick up from any of
the 8 roadmap steps without re-discovering the architecture.
== Next action
Step 1 of the roadmap: extract QwenDenseBase from Qwen2DenseAdapter +
Qwen3DenseAdapter. The OOP rule justifies it now that two siblings exist
with proven Tier 1 dispatch behavior. Same shape on both axes — same
forge_model.prune call, same forge_model.train_lora call. This commit
documents the plan; the next commit lands the extraction.
…n2DenseAdapter
First "correct architecture" roadmap step from docs/PLUGIN-SPRINT.md.
Pure refactor — test stays at 40 passed, 0 skipped, 2 xfailed.
== What moved
Both Qwen3DenseAdapter and Qwen2DenseAdapter had parallel prune() / train()
bodies that called forge_model.prune + defrag_inline.defrag_live_model the
same way. Per the OOP rule (~/.claude/.../memory/feedback_adapters_not_branches.md
+ CLAUDE.md outlier-validation strategy): write two siblings first, prove
they work, THEN extract a base from the shared 80%.
Both siblings have proven Tier 1 dispatch behavior across all 17 published
continuum-ai/* artifacts. This commit extracts.
scripts/adapters/qwen_dense_base.py (NEW, 277 lines):
QwenDenseBase(FamilyAdapter) owns:
- prune(): full body — compute_head_importance + forge_model.prune
(forward_hooks) + immediate defrag_inline.defrag_live_model + per-layer
importance bookkeeping. Lazy imports so Tier 1 dispatch stays torch-free.
Short-circuits cleanly when ctx.model is None.
- train(): dispatches internally on the 'teacher' field. If present, routes
to _train_compensation (§4.1.3.3 KL distillation, currently a Tier 2
stub pointing at compensation_lora.py). If absent, routes to
_train_recovery (forge_model.train_lora — REAL Tier 2 wiring).
The dispatch-on-teacher pattern collapses what was two parallel methods
on Qwen2DenseAdapter (which had compensation handling) and Qwen3DenseAdapter
(which only had recovery training) into one.
scripts/adapters/qwen3_dense.py (55 lines, was 216):
Qwen3DenseAdapter(QwenDenseBase) — declares architectures = ("qwen3_5",)
and overrides context_extend() for the qwen3.5-4b-code-128k-forged YaRN
variant. Inherits prune + train + everything else from the base.
scripts/adapters/qwen2_dense.py (31 lines, was 147):
Qwen2DenseAdapter(QwenDenseBase) — pure inheritance. Just declares
architectures = ("qwen2",). The compensation distillation handling that
used to live here is now in the base's train() dispatch.
== Why this is the right shape
A future dense Qwen-family adapter (Qwen3.5-VL dense pathway, a Qwen3.6
family if it ships) inherits from QwenDenseBase by default and only overrides
the methods that differ for its family. Adding such a sibling is now ~30
lines, not ~150.
A new dense family that is NOT Qwen (Llama, Mistral) gets its own base if
its forge_model code path differs — but for the Qwen lineage specifically,
forge_model.prune is architecture-agnostic and handles all of them via the
same code, so they all share QwenDenseBase.
The base-class methods stay frozen against the published artifacts. New
methodology arrives as NEW adapters with NEW architecture strings, never
as edits to QwenDenseBase or its subclasses.
== Test status (unchanged — pure refactor)
Tier 1 dispatch: 19 passed
Tier 3 sample-hash: 7 passed, 2 xfailed
Tier 4 canonical: 14 passed
Combined: 40 passed, 0 skipped, 2 xfailed in 287s
The override-detection logic in the dispatch test compares
Qwen3DenseAdapter.prune (which resolves up the MRO to QwenDenseBase.prune)
against FamilyAdapter.prune (the NotImplementedError stub). They're
different function objects, so the inherited override IS detected — the
test passes for both subclasses.
== LOC compression
qwen_dense_base.py: +277 (new shared base)
qwen3_dense.py: -161 (216 → 55)
qwen2_dense.py: -116 (147 → 31)
Net total: ~unchanged on this pair, but the next sibling
costs ~30 lines instead of ~150. Compounding wins
as more dense families ship.
== Next
Step 2 of the roadmap: extract MoEUnfusedExpertsBase from Qwen3MoEAdapter
and OlmoeAdapter. Same shape, same justification — both siblings exist
with proven Tier 1 dispatch, both will Tier-2-wire to the same scripts
(expert_activation_profile.py, cpu_expert_prune_v2.py).
Second "correct architecture" roadmap step from docs/PLUGIN-SPRINT.md.
Pure refactor — test stays at 40 passed, 0 skipped, 2 xfailed.
Same OOP shape as Step 1 (QwenDenseBase): two siblings exist with proven
Tier 1 dispatch behavior, both will Tier-2-wire to the same scripts
(scripts/expert_activation_profile.py and scripts/cpu_expert_prune_v2.py
--importance-json), so the shared 80% gets pulled up into a base.
scripts/adapters/moe_unfused_base.py (NEW, 183 lines):
MoEUnfusedExpertsBase(FamilyAdapter) owns:
- expert_activation_profile() — § 4.1.3.4 calibration-aware MoE expert
importance profiling. Reads calibrationCorpus/File/Examples/Tokens
from the alloy stage params, lazy-imports the script (Tier 2 stub
today, raises with a clear pointer until roadmap step 3 lands the
Python API extraction). Short-circuits cleanly when ctx.model is None.
- expert_prune() — per-layer top-K removal keyed to the importance JSON
from the upstream profiling stage. Reads keepExpertsPerLayer,
originalExpertsPerLayer, prunePct, strategy, perLayerNormalized,
expertTensorLayout, etc. from the alloy. Tier 2 stub today.
Both methods short-circuit cleanly when ctx.model is None, so the Tier 1
dispatch path stays working.
IMPORTANT: this base assumes the unfused experts layout that Qwen3MoE
and OLMoE both use (model.layers.{i}.mlp.experts.{e}.{gate,up,down}_proj
+ model.layers.{i}.mlp.gate). Future MoE families with DIFFERENT layouts
(Mixtral block_sparse_moe, Granite-MoE fused, DeepSeek-V2 routed+shared,
Phi-MoE) either ship as their own family adapter that overrides
expert_prune entirely OR extend the dispatch inside cpu_expert_prune_v2.py
per the expertTensorLayout field. NOT by adding `if architectures[0]
== ...` branches to this base. The base's docstring spells this out
explicitly to prevent the never-branch failure mode.
scripts/adapters/qwen3_moe.py (37 lines, was 147):
Qwen3MoEAdapter(MoEUnfusedExpertsBase) — pure inheritance. Just declares
architectures = ("qwen3_moe",). Handles the morning's flagship
qwen3-coder-30b-a3b-compacted-19b-256k §4.1.3.4 anchor.
scripts/adapters/olmoe.py (34 lines, was 104):
OlmoeAdapter(MoEUnfusedExpertsBase) — pure inheritance. Just declares
architectures = ("olmoe",). Handles the §4.1.3.4 cross-architecture
anchor olmoe-1b-7b-compacted-5b.
== Why this is the right shape
Qwen3-235B-A22B (frontier target) and Qwen3-Coder-480B-A35B-Instruct
(moonshot target) both use the qwen3_moe architecture string and the
unfused experts layout — they will inherit Qwen3MoEAdapter directly with
no code changes when the alloy declares them. Same for any future Qwen3MoE
forge.
Adding a new MoE family with the SAME unfused layout (e.g. a future
allenai OLMoE-2 variant) is one new file with ~25 lines —
class XAdapter(MoEUnfusedExpertsBase): architectures = ("x",) — and one
import line in __init__.py.
Adding a new MoE family with a DIFFERENT layout (Mixtral, GraniteMoE,
DeepseekV2, PhiMoE) is one new file that EITHER inherits from this base
and overrides expert_prune for the layout-specific tensor walk OR (if the
shared behavior is too thin) introduces its own base. The decision lands
when the second non-unfused MoE family ships and the right abstraction
becomes visible — per the outlier-validation rule, don't speculate.
== Test status (unchanged — pure refactor)
Tier 1 dispatch: 19 passed
Tier 3 sample-hash: 7 passed, 2 xfailed
Tier 4 canonical: 14 passed
Combined: 40 passed, 0 skipped, 2 xfailed in 285s
The override-detection logic in the dispatch test compares
Qwen3MoEAdapter.expert_prune (which resolves up the MRO to
MoEUnfusedExpertsBase.expert_prune) against FamilyAdapter.expert_prune
(the NotImplementedError stub). They're different function objects, so
the inherited override IS detected — the test passes for both subclasses.
== LOC after Step 2
moe_unfused_base.py: +183 (new shared base)
qwen3_moe.py: -110 (147 → 37)
olmoe.py: -70 (104 → 34)
Net total: ~unchanged on this pair, but the next MoE-unfused
sibling costs ~25 lines instead of ~150.
Compounding wins as more MoE-unfused families ship.
== Plugin-sprint roadmap progress (steps in docs/PLUGIN-SPRINT.md)
Step 1 ✓ (commit db54f9d) — QwenDenseBase extracted
Step 2 ✓ (this commit) — MoEUnfusedExpertsBase extracted
Step 3 — Tier 2 wiring for the MoE adapters (refactor expert_activation_profile.py
+ cpu_expert_prune_v2.py to expose callable functions; replace
NotImplementedError stubs in this base with real lazy-imported calls)
Step 4 — Eval-runner registry on family adapters (unblocks frontier targets)
Step 5 — forge-alloy llm-forge domain extension (cross-repo)
Step 6 — Vision-safety integration (Qwen3VLAdapter)
Step 7 — modelHash convention unification
Step 8 — priorMetricBaselines.samplesHash + calibration corpus upload
Next: Step 3 (Tier 2 wiring for MoE adapters).
Third "correct architecture" roadmap step from docs/PLUGIN-SPRINT.md.
The MoEUnfusedExpertsBase methods (expert_activation_profile and
expert_prune) no longer raise NotImplementedError when ctx.model is
non-None. They call directly into the underlying scripts via lazy
import. There is exactly one code path per method. The CLI wrappers and
the adapter call sites both invoke the same Python function — no second
path, no deferred state, no silent substitution surface.
== scripts/expert_activation_profile.py — refactored
Two callable entry points + one private inner. Both entry points produce
the same JSON output, both write the importance JSON to disk, both
return the data dict:
profile_experts(*, model, tokenizer, calibration_data, output, ...)
Used by the family-adapter set. Caller provides an already-loaded
model + tokenizer (the alloy_executor has already loaded them onto
the GPU); this function does NOT touch model loading. Delegates
to _profile_inner.
profile_experts_from_path(model_path, calibration_data, output, ...)
Used by the CLI entry point. Loads tokenizer + model from
`model_path` using BitsAndBytesConfig 8-bit on the requested
device, then delegates to _profile_inner.
_profile_inner(...) — the actual hooking + inference + counting + JSON
writing. Both entry points call this with identical semantics.
The CLI main() is now a thin argparse wrapper that constructs the args
and calls profile_experts_from_path. Same script-level behavior, just
factored so the body is reachable by callers other than __main__.
Defensive checks that used to call sys.exit() now raise ValueError or
RuntimeError so the adapter can catch + propagate them as alloy
execution failures, not as process exits.
The "RuntimeError: no router gates found" path used to silently `return 1`
and let the CLI exit; it now raises with a clear message naming the
expected layout (mlp.gate) and pointing future MoE families with
different layouts at the correct fix (write a new family adapter that
overrides the method, do not branch this script).
== scripts/cpu_expert_prune_v2.py — refactored
Single callable entry point (this script is path-based — it operates on
a model_dir on disk and a out_dir on disk, never on a loaded model
object, because it does a streaming safetensors rewrite that wouldn't
fit in memory for big models):
prune_experts(model_dir, out_dir, keep_experts, *, shard_bytes,
importance_json) -> dict
Reads the source model's safetensors shards, selects top-K experts
per layer using the importance metric (calibration-aware activation
count if importance_json is provided, router gate row L2 norm
otherwise), rewrites the surviving experts into out_dir with the
router gate row-sliced to match. Updates out_dir/config.json.
Writes the expert_prune.metadata.v1.json sidecar.
The CLI main() is now a thin argparse wrapper that calls prune_experts.
Defensive sys.exit() calls became ValueError / RuntimeError so the
adapter can catch + propagate them.
== scripts/adapters/moe_unfused_base.py — REAL Tier 2 wiring
MoEUnfusedExpertsBase.expert_activation_profile():
- Reads calibrationCorpusFile / calibrationCorpus from the alloy stage params
- Resolves it against ctx.output_dir if relative
- Raises FileNotFoundError if the corpus is missing (the §4.1.3.4.1
discipline gate requires the corpus to be present and hash-pinned)
- Lazy-imports expert_activation_profile.profile_experts
- Calls it with ctx.model + ctx.tokenizer (already loaded by alloy_executor)
- Writes the importance JSON to ctx.output_dir/importance.activation_count.json
- Stashes the path on ctx.importance_json_path so the downstream
expert_prune stage can find it without having to know the filename
MoEUnfusedExpertsBase.expert_prune():
- Reads keepExpertsPerLayer / strategy / expertTensorLayout from the
alloy stage params
- Raises ValueError if expertTensorLayout != "mlp-experts-unfused"
(the base only handles the unfused layout that Qwen3MoE + OLMoE share;
fused / block_sparse / granite-fused / deepseek-routed-shared layouts
need their own family adapter that overrides this method, NOT a
branch in the base)
- Lazy-imports cpu_expert_prune_v2.prune_experts
- Calls it with ctx.source_model_dir + ctx.output_dir/pruned + the
importance_json path stashed by the upstream stage
- Reloads ctx.model from the pruned dir so downstream stages (quant,
eval, package, publish) operate on the pruned model rather than the
in-memory original
- Frees the original model's GPU memory before loading the pruned one
(torch.cuda.empty_cache + del ctx.model)
Both methods short-circuit cleanly when ctx.model is None, which is
exactly the Tier 1 dispatch test path. The Tier 1 test stays Mac-safe.
The Tier 2 path lights up the moment the executor runs against a real
loaded model on a 5090.
== What this commit does NOT do
It does NOT verify the Tier 2 path produces a bit-identical result to
the morning's flagship qwen3-coder-30b-a3b artifact. That verification
requires running the full forge against the loaded base model on a 5090
and comparing the resulting safetensors hash to the alloy's modelHash.
That's the Tier 2 reproducibility test that's still pending — runs on
BigMama, not on this Mac. Code is ready for it.
It does NOT touch the dense path. QwenDenseBase.prune already had real
Tier 2 wiring from commit 4d087d4 (it always did — the wiring was moved
out of PruneExecutor in that commit). QwenDenseBase.train()'s
compensation distillation path is still a roadmap-step-future stub
because compensation_lora.py needs the same in-process refactor pattern
applied to it, and that's a separate commit so this commit's diff stays
focused on the MoE side.
== Test status
Same as Steps 1 and 2 — pure refactor at the test layer because the
Tier 1 dispatch test never invokes the methods on a loaded model:
Tier 1 dispatch: 19 passed
Tier 3 sample-hash: 7 passed, 2 xfailed
Tier 4 canonical: 14 passed
Combined: 40 passed, 0 skipped, 2 xfailed in 288s
== Roadmap progress
Step 1 ✓ (db54f9d) — QwenDenseBase extracted
Step 2 ✓ (903e898) — MoEUnfusedExpertsBase extracted
Step 3 ✓ (this commit) — MoE base Tier 2 wiring REAL
Step 4 — Eval-runner registry on family adapters
Step 5 — forge-alloy llm-forge domain extension (cross-repo)
Step 6 — Vision-safety integration (Qwen3VLAdapter)
Step 7 — modelHash convention unification
Step 8 — priorMetricBaselines.samplesHash + calibration corpus upload
== Note for future-Claude reading this commit message
The MoEUnfusedExpertsBase wiring assumes the executor populates two ctx
attributes that don't currently exist on ForgeContext (per
scripts/stages/base.py):
ctx.source_model_dir — local path to the unmodified base model
(the safetensors shards on disk)
ctx.importance_json_path — set by expert_activation_profile to pass
data to the downstream expert_prune stage
These need to be added to ForgeContext + populated by alloy_executor's
model-loading phase before the Tier 2 path can actually run end-to-end
on a 5090. That's a small follow-up that lands together with the first
real BigMama Tier 2 reproducibility test run, NOT in this commit (the
adapter code already references the attrs and will raise loudly with
clear messages if they're missing — exactly the deterministic-error
contract this whole architecture exists to preserve).
…st stub (TDD)
Closes the last NotImplementedError stub in the adapter set. The
QwenDenseBase._train_compensation method (the §4.1.3.3 path the morning's
qwen2.5-coder-7b-compacted artifact was forged through) now calls
compensation_lora.compensate_lora() directly via lazy import. There are
no stubs left in any family adapter; every method has exactly one code
path that either runs the real work or short-circuits cleanly when
ctx.model is None for the dispatch-only Tier 1 test.
Written test-first per TDD/TDValidation discipline. The unit test in
tests/unit/adapters/test_compensation_lora_api.py is the SPEC; the
refactor is the implementation that satisfies it. Test was red against
the pre-refactor state, green against the post-refactor state, with no
intermediate stubbing.
== TDD cycle
1. Wrote tests/unit/adapters/test_compensation_lora_api.py asserting:
- compensate_lora is importable as a callable
- compensate_lora has the right keyword-only signature with all
required kwargs
- compensate_lora raises FileNotFoundError on missing calibration corpus
- compensate_lora raises ValueError on invalid loss_type
- compensate_lora_from_paths is importable
- compensate_lora_from_paths has the right signature
- main() (CLI wrapper) still exists
- QwenDenseBase._train_compensation source contains the lazy import
and calls compensate_lora (not raise NotImplementedError)
- QwenDenseBase._train_compensation short-circuits cleanly when
ctx.model is None
2. Ran the test — RED. 7 of 9 failed because compensation_lora.py
imported peft / torch / transformers at module top, so the script
couldn't even be imported on a Mac. The 8th failed because
QwenDenseBase._train_compensation still had the NotImplementedError
stub.
3. Refactored compensation_lora.py:
- Heavy ML imports (peft, torch, torch.nn.functional, transformers,
torch.utils.data) moved INSIDE the functions that use them. Module
is now importable on Mac without any of those installed.
- JsonlTextDataset class definition moved into a make_jsonl_text_dataset
factory function so the torch.utils.data.Dataset base class import
is also lazy.
- VALID_LOSS_TYPES / VALID_TEACHER_QUANTS / VALID_STUDENT_QUANTS
frozensets at module level for the unit test to verify against.
- _validate_compensate_inputs(...) helper validates every input at
the entry surface BEFORE touching any heavy machinery. Loud failures
here mean the contract is wrong; error messages name the offending
field.
- _compensate_inner(*, teacher, student, tokenizer, ...) — the actual
distillation training loop, takes pre-loaded models. Wraps the
existing code path that used to live in main().
- compensate_lora(*, student, student_tokenizer, teacher_path,
teacher_quant, ...) — adapter entry point. Caller provides loaded
student, this function loads the teacher in the requested quant
tier and delegates to _compensate_inner.
- compensate_lora_from_paths(*, teacher_path, student_path, ...) —
CLI entry point. Loads BOTH teacher and student from disk paths
and delegates.
- main() — now a thin argparse wrapper that calls
compensate_lora_from_paths.
4. Wired QwenDenseBase._train_compensation:
- Reads teacher / teacherPrecision / calibrationDataset / loraRank /
loraAlpha / lossType / steps / learningRate / targetModules /
maxLength from the alloy stage params
- Resolves calibrationDataset relative to ctx.output_dir if not absolute
- Lazy-imports compensation_lora.compensate_lora
- Calls it with ctx.model + ctx.tokenizer (already loaded by alloy_executor)
- Reloads ctx.model from the compensated dir for downstream stages
- Frees the original model's GPU memory before reload
- Raises ValueError with clear messages on missing teacher / missing
calibration corpus — the contract violation surface lives at the
adapter, not in compensation_lora.py
5. Ran the test — GREEN, 9 of 9.
6. Ran the full reproducibility suite — 40 + 9 = 49 passed, 2 xfailed
(the same priorMetricBaselines samplesHash gap that's tracked at the
schema layer).
== What this commit does NOT do
It does NOT verify the Tier 2 path produces a bit-identical
qwen2.5-coder-7b-compacted artifact end-to-end against the published
modelHash. That requires a 5090 with the v2-7b base + teacher loaded,
and runs at the Tier 2 reproducibility layer (still pending). The
adapter code is ready for it.
== Roadmap progress
Step 1 ✓ db54f9d — QwenDenseBase extracted
Step 2 ✓ 903e898 — MoEUnfusedExpertsBase extracted
Step 3 ✓ ae081ea — MoE Tier 2 wiring real
Step 3.5 ✓ this — Dense compensation Tier 2 wiring real (last stub closed)
Step 4 — Eval-runner registry on family adapters (next)
Step 5 — forge-alloy llm-forge domain extension
Step 6 — Vision-safety integration (Qwen3VLAdapter)
Step 7 — modelHash convention unification
Step 8 — priorMetricBaselines.samplesHash + calibration corpus upload
Combined test status:
tests/reproducibility/ 40 passed, 0 skipped, 2 xfailed
tests/unit/adapters/ 9 passed
Combined: 49 passed, 2 xfailed in 290s
Fourth "correct architecture" roadmap step from docs/PLUGIN-SPRINT.md.
The architectural piece that unblocks frontier targets: family adapters
dispatch benchmark evaluation through a runner registry instead of
carrying their own per-benchmark code. Adding a new benchmark suite
(SWE-Bench Pro for Qwen3-Coder-480B, LiveCodeBench v6 for the frontier
coder cards, MMMU for vision targets, etc.) is one new file in
scripts/eval_runners/ plus one import line — never an edit to any
family adapter.
Written test-first per TDD/TDValidation discipline. The unit test in
tests/unit/adapters/test_eval_runner_registry.py is the SPEC; the
implementation that follows satisfies it.
== TDD cycle
1. Wrote tests/unit/adapters/test_eval_runner_registry.py asserting:
- eval_runners.base.BenchmarkRunner ABC + ScoreResult dataclass exist
- BenchmarkRunner.score signature is (self, samples_path)
- ScoreResult carries benchmark_name + pass_at_1 + passed + total
- eval_runners.BenchmarkRunnerRegistry can register + resolve
- Unknown benchmark name raises BenchmarkNotRegistered with a clear
message naming what IS registered
- Double-registering a DIFFERENT class against an existing name
raises ValueError (silent shadowing is the f-word pattern)
- HumanEvalRunner is registered globally against name 'humaneval'
- HumanEvalPlusRunner is registered globally against 'humaneval_plus'
- HumanEvalRunner.score on the morning's flagship qwen3-coder-30b-a3b
student JSONL reproduces pass@1 = 0.884 = 145/164 (the published
headline) — end-to-end smoke test through the registry path
- FamilyAdapter.eval source contains the registry dispatch (not the
no-op return ctx default)
2. Ran the test — RED, 11 of 11.
3. Built scripts/eval_runners/:
- base.py: BenchmarkRunner ABC + ScoreResult dataclass. score()
takes samples_path, returns ScoreResult. Subclasses set the .name
class attribute.
- registry.py: BenchmarkRunnerRegistry singleton + BenchmarkNotRegistered
exception. Mirror of scripts/adapters/registry.py — exact-match
dispatch on benchmark name string, idempotent re-registration of
the same class, raise ValueError on different class against same
name (no silent shadowing).
- humaneval.py: HumanEvalRunner — wraps tests/reproducibility/_humaneval_scorer.py
(the canonical macOS-safe evalplus subprocess wrapper) and returns
a ScoreResult with pass_at_1 = humaneval base score.
- humaneval_plus.py: HumanEvalPlusRunner — same scorer, returns the
+plus pass_at_1 (base AND plus tests both passing per evalplus's
canonical convention).
- __init__.py: module-level singleton + resolve_runner / registered_benchmarks
module helpers + eager imports of humaneval + humaneval_plus so
they're registered at package import time.
4. Wired FamilyAdapter.eval():
- Reads benchmarks list from alloy stage params
- For each benchmark, looks up the runner via resolve_runner(name)
- Calls runner.score(samplesPath) — resolves samples path against
ctx.output_dir if relative
- Appends a benchmark entry to ctx.eval_results carrying the
ScoreResult fields (pass_at_1, passed, total, samplesPath, metric)
so the EvalExecutor can merge them into ctx.alloy['results']['benchmarks']
- Lazy import of eval_runners + scorer so Tier 1 dispatch path stays
torch-free
- Raises ValueError loudly on benchmark missing 'name' or 'samplesPath'
- Family adapters MAY override .eval() if they need family-specific
orchestration (e.g. a future Qwen3VLAdapter attaching an image
preprocessor). Most won't.
5. Ran the test — GREEN, 11 of 11. The end-to-end smoke test
(test_humaneval_runner_scores_a_real_published_jsonl) actually scored
the morning's flagship student JSONL via the registry path and got
exactly 145/164 = 0.884, matching the published headline.
6. Ran the full reproducibility + unit suite — 40 + 20 = 60 passed, 2
xfailed (the same priorMetricBaselines samplesHash gap).
== What this commit DOES enable
- Adding HumanEval+ to an alloy is a registry resolve (already there)
- Adding MMLU is a single new file scripts/eval_runners/mmlu.py
+ one import in __init__.py
- Adding SWE-Bench Pro (the Qwen3-Coder-480B benchmark) is the same
pattern. The frontier target's eval suite slots in without touching
any family adapter.
- Adding MMMU when the first VL artifact ships: same pattern.
== What this commit does NOT do
It does NOT migrate the existing tests/reproducibility/test_published_alloys_scoring.py
test (Tier 4) to use the registry path. That test still imports the
canonical scorer directly. Migrating it is a follow-up; both paths
produce identical results because the registry's HumanEvalRunner just
delegates to the same scorer module.
It does NOT touch scripts/stages/output_stages.py::EvalExecutor. The
existing eval executor (which runs evalplus.codegen + evalplus.evaluate
on a model directory) is independent of the adapter layer's eval()
method. Both paths exist; they're complementary, not redundant.
The existing executor is the path used during a forge run when an
upstream codegen step doesn't exist; the family adapter's eval() is
the path used when samples are already on disk and only scoring is
needed. Future cleanup can unify them (the executor calls family.eval
which calls the registry) but that's a separate commit.
== Roadmap progress
Step 1 ✓ db54f9d — QwenDenseBase extracted
Step 2 ✓ 903e898 — MoEUnfusedExpertsBase extracted
Step 3 ✓ ae081ea — MoE Tier 2 wiring real
Step 3.5 ✓ 45beb54 — Dense compensation Tier 2 wiring real
Step 4 ✓ this — Eval-runner registry
Step 5 — forge-alloy llm-forge domain extension (next)
Step 6 — Vision-safety integration (Qwen3VLAdapter)
Step 7 — modelHash convention unification
Step 8 — priorMetricBaselines.samplesHash + calibration corpus upload
Combined test status:
tests/reproducibility/ 40 passed, 2 xfailed
tests/unit/adapters/ 20 passed
Combined: 60 passed, 2 xfailed in 308s
Step 5 of the "correct architecture" roadmap landed on forge-alloy at
commit 4fd715e (branch domain-extensibility-refactor). This commit
updates the design doc with the new state:
- Step 5 marked done with the forge-alloy commit hash
- Documents the schema gaps the regression gate caught and fixed inline
(AlloyHardware.deviceTargets, AlloyResults.forgedParamsB+activeParamsB,
BenchmarkResult first-class fields, extra='allow' everywhere)
- Documents what's still pending under Step 5 as a pure-move follow-up
refactor commit (the actual class definitions still live in
forge_alloy/types.py; llm_forge.py re-exports them today)
- The wip/types-additive-checkpoint-bd4349d branch is still preserved
per the never-lose-work rule
Cross-repo state after Step 5:
forge-alloy domain-extensibility-refactor:
bd4349d types: temporary additive checkpoint (kept on wip branch)
4fd715e domains: forge_alloy.domains package (this step's commit)
Tests: 17 domain-layout passed + 3 published-alloy regression passed
sentinel-ai cross-arch-portability-fixes:
16 plugin-sprint commits, 60 reproducibility+unit passed, 2 xfailed
Roadmap progress:
Step 1 ✓ db54f9d — QwenDenseBase extracted
Step 2 ✓ 903e898 — MoEUnfusedExpertsBase extracted
Step 3 ✓ ae081ea — MoE Tier 2 wiring real
Step 3.5 ✓ 45beb54 — Dense compensation Tier 2 wiring real
Step 4 ✓ 1e90097 — Eval-runner registry on family adapters
Step 5 ✓ forge-alloy 4fd715e — llm-forge domain extension (cross-repo)
Step 6 — Vision-safety integration (Qwen3VLAdapter)
Step 7 — modelHash convention unification
Step 8 — priorMetricBaselines.samplesHash + calibration corpus upload
…adapter set (TDD) Sixth "correct architecture" roadmap step from docs/PLUGIN-SPRINT.md. The vision_safety.py whitelist module (committed in f82773b as part of the morning's pre-crash VL forge scaffolding) is now wired into a real family adapter so the dispatch test routes any VL alloy through a path that consults the whitelist before, during, and after every tensor walk. This is the architectural piece that makes the existing 8 Qwen3.5-derived artifacts re-forgeable into VL-preserved variants without code edits to any shared script. When the first Qwen3.5-VL re-forge runs (under roadmap-step-6-equivalent post-Tier-2-wiring), the dispatch path picks QwenVLAdapter automatically off source.architecture, the whitelist construction + bit-exact verification fires automatically, and the vision tower + merger params + vision token vocab indices are preserved bit-exact through prune / train / quant. Without this commit, the same re-forge would silently destroy the vision pathway (the legacy Qwen3.5 catalog's "missed opportunity" that the morning's audit caught). Written test-first per TDD/TDValidation discipline. The unit test in tests/unit/adapters/test_vision_safety_adapter.py is the SPEC; the adapter is the implementation that satisfies it. == TDD cycle 1. Wrote tests/unit/adapters/test_vision_safety_adapter.py asserting: - adapters.qwen_vl is importable on a Mac (lazy-imports vision_safety) - QwenVLAdapter is registered against both 'qwen2_5_vl' and 'qwen3_5_vl' (Qwen2.5-VL ships today as Qwen2_5_VLForConditionalGeneration with model_type='qwen2_5_vl'; Qwen3.5-VL when it ships will use the same vision-tower preservation pattern, so both architecture strings live in the same .architectures tuple) - QwenVLAdapter inherits from QwenDenseBase (text-decoder layer is identical; the VL-specific work is a decorator on top of the inherited bodies, never a parallel code path) - prune() body references vision_safety (not just inheriting the base) - train() body references vision_safety / filter_target_modules - modality() is a real override (not the FamilyAdapter base NotImplementedError stub) - A synthetic in-memory VL alloy with [modality, prune, train] stages resolves to a 3-element chain on QwenVLAdapter via resolve_adapter_chain - vision_safety.py module exposes its expected callable API 2. Ran the test — RED, 8 of 9 (the vision_safety import-smoke passed because the module already exists). 3. Built scripts/adapters/qwen_vl.py: QwenVLAdapter(QwenDenseBase) with architectures = ("qwen2_5_vl", "qwen3_5_vl"). prune(ctx, **params): - Lazy-imports build_whitelist_from_model + verify_bit_exact_preservation - Builds the whitelist BEFORE pruning so the post-prune sha256 check has a baseline to compare against - Stashes the whitelist on ctx.vl_whitelist as the single source of truth for downstream stages in the same alloy - Calls super().prune() — the inherited dense prune body walks the text-decoder attention modules, untouched by vision_safety - Calls verify_bit_exact_preservation(ctx.model, whitelist) AFTER — raises loudly if any vision-side param moved during prune. Loud failure is the goal: a silent vision-tower corruption would ship a broken artifact. train(ctx, **params): - Same pattern (build / reuse whitelist, super().train(), verify after) - Additionally filters params['targetModules'] through vision_safety.filter_target_modules() BEFORE delegating to base. This drops any vision-side projection that happens to share a name with a text-side LoRA target (e.g. 'fc1' on model.visual.merger.* would otherwise get a LoRA attached and merge_and_unload would corrupt the vision tower). - Logs the dropped target count for forensic visibility modality(ctx, **params): - Real override (not raise). Asserts vision_config is present and the published vision token ids match via assert_vl_config — loud failure if the model's VL config is broken before forging. - Builds the whitelist if no upstream stage already did, so the eventual prune / train stages share it. - For Qwen VL family the vision encoder is ALREADY attached in the base model, so the modality stage is a declaration + invariant check rather than an attach operation. 4. Registered in scripts/adapters/__init__.py alongside the other family adapters. Ordered in the dense group (after qwen2_dense) because it inherits from QwenDenseBase. 5. Ran the test — GREEN, 9 of 9. 6. Ran the full suite — 40 reproducibility + 29 unit = 69 passed, 2 xfailed (the same priorMetricBaselines.samplesHash gap). == What this commit DOES enable - Adding any future Qwen-VL family forge to the dispatch test catalog is one new entry — the adapter is already registered. - Re-forging the legacy Qwen3.5 catalog with vision preservation once Tier 2 / Step 7 / Step 8 land: zero code changes to the family-adapter set, just a new alloy that declares source.architecture='qwen2_5_vl' (or 'qwen3_5_vl') and includes a modality stage. The dispatcher routes through QwenVLAdapter automatically. - Future VL families with the same vision-tower preservation pattern (Qwen3.5-VL, hypothetical Qwen3.6-VL, ...) inherit this adapter's behavior just by adding their architecture string to the tuple. == What this commit does NOT do It does NOT verify the Tier 2 path actually preserves the vision tower bit-exact end-to-end against a real loaded Qwen2.5-VL model. That requires a 5090 with the Qwen2.5-VL-3B base loaded and runs at the Tier 2 reproducibility level. The adapter's bit-exact verification will catch any issue at that point with a loud assertion. The Tier 1 dispatch test catalog also doesn't include a real VL alloy yet because no continuum-ai/* VL artifact has been published; the synthetic in-memory alloy in the unit test exercises the dispatch path. It does NOT migrate vision_safety.py to lazy-import torch / transformers. That module still imports them at the top — which is correct because its callable API operates on already-loaded models / configs and the Tier 1 dispatch path lazy-imports vision_safety inside the adapter methods, never at import time. The adapter's lazy import is the correct deterministic boundary. == Roadmap progress Step 1 ✓ db54f9d — QwenDenseBase extracted Step 2 ✓ 903e898 — MoEUnfusedExpertsBase extracted Step 3 ✓ ae081ea — MoE Tier 2 wiring real Step 3.5 ✓ 45beb54 — Dense compensation Tier 2 wiring real Step 4 ✓ 1e90097 — Eval-runner registry on family adapters Step 5 ✓ forge-alloy 4fd715e — llm-forge domain extension Step 6 ✓ this commit — Vision-safety integration (QwenVLAdapter) Step 7 — modelHash convention unification (next) Step 8 — priorMetricBaselines.samplesHash + calibration corpus upload Combined test status: tests/reproducibility/ 40 passed, 2 xfailed tests/unit/adapters/ 29 passed Combined: 69 passed, 2 xfailed in 318s
…fill (TDD)
Seventh "correct architecture" roadmap step. Until this commit,
publish_model.py and the backfill tools used different modelHash
conventions over the same underlying bytes. A verifier had to know
which convention each alloy used. This commit makes there be exactly
one source of truth for the modelHash composition function across the
entire codebase, migrates all 9 cached alloys that didn't already
satisfy the new convention, and gates everything with a TDD test that
fires if any alloy ever drifts back.
Written test-first per TDD/TDValidation discipline. Test caught a real
bug in compose_model_hash (sort_keys doesn't sort the LIST, only dict
keys within each item) — the fix is in this commit.
== TDD cycle
1. Wrote tests/unit/adapters/test_modelhash_convention.py asserting:
- scripts/alloy_hashing.py is importable and exposes
compose_model_hash + fetch_shard_hashes_from_hf
- compose_model_hash is order-independent (test caught a bug here —
sort_keys=True only sorts dict keys, not the list itself; fixed)
- compose_model_hash changes when any shard changes (sensitivity)
- compose_model_hash raises ValueError on empty input (loud failure)
- publish_model.py imports from alloy_hashing (single source of truth)
- backfill_alloy_from_results.py + derive_alloy_from_parent.py also
import from alloy_hashing
- Every cached alloy has integrity.fileHashes[] populated
- Every cached alloy's recorded modelHash equals
compose_model_hash(integrity.fileHashes)
- scripts/migrate_modelhash_convention.py exists as the one-shot
migration tool
2. Ran the test — RED, 9 of 10.
3. Built scripts/alloy_hashing.py — the unified hashing layer:
- compose_model_hash(shard_hashes) — pure function, sorts input by
filename internally for order-independence, returns sha256: prefixed
hex string. ValueError on empty input.
- fetch_shard_hashes_from_hf(repo, extensions) — pulls per-shard sha256
from HuggingFace's LFS metadata API (?blobs=true). No downloads.
Returns the same shape as the local-hashing variant.
- hash_local_safetensors_dir(model_dir) — hashes every *.safetensors
in a local directory (used by publish_model.py for freshly-forged
artifacts where the shards aren't on HF yet).
4. Updated the three callers to import from the shared module:
- publish_model.hash_model_weights now delegates to
hash_local_safetensors_dir + compose_model_hash. Returns
(modelHash, fileHashes) so callers persist BOTH into
results.integrity. The legacy concat-and-hash convention is gone.
- backfill_alloy_from_results.py removed its private
_shard_hashes_via_lfs and _model_hash_from_shard_hashes helpers,
imports from alloy_hashing instead.
- derive_alloy_from_parent.py same.
5. Built scripts/migrate_modelhash_convention.py — one-shot tool that
walks every cached alloy, populates fileHashes[] from HF's LFS metadata
when missing, recomposes modelHash via compose_model_hash, writes the
updated alloy to disk. Idempotent — running twice produces the same
output. Defaults to dry-run; --confirm actually rewrites.
6. Ran the migration with --confirm — 9 cached alloys migrated, 8 already
canonical (the 8 backfilled alloys that already used this convention
from day one).
Migrated:
qwen3-coder-30b-a3b-compacted-19b-256k (the morning's flagship)
olmoe-1b-7b-compacted-5b
qwen2.5-coder-7b-compacted (the v2-7b §4.1.3.3 anchor)
qwen3.5-0.8b-general-forged
qwen3.5-2b-general-forged
qwen3.5-4b-general-forged
qwen3.5-4b-code-forged
qwen3.5-4b-code-128k-forged
qwen3.5-9b-general-forged
Skipped (already canonical, fileHashes set by the backfill scripts):
qwen2.5-{0.5b,1.5b,3b}-general-forged
qwen3.5-27b-code-forged
qwen3.5-{27b,4b}-code-forged-{defragged,GGUF,mlx-4bit}
7. Re-ran the test — GREEN, 10 of 10.
8. Ran the full suite — 79 passed, 2 xfailed (the same priorMetricBaselines.samplesHash
gap that closes in Step 8). Up from 69 (before Step 7) due to the 10
new modelhash unit tests.
== What this commit DOES enable
- A single verifier can recompute modelHash from any cached alloy (or
any HF-hosted alloy) using the same one function — no convention
fork. Reproducible from HF metadata alone for any artifact size
(the 27B's 11×5GB shards verify in seconds, not hours).
- integrity.fileHashes[] is now universal across the cached catalog —
every alloy carries per-shard attestation, so a verifier can also
check individual shards if they want (the modelHash is a roll-up
over the same data).
- Future forge runs through publish_model.py automatically write both
fields. Future backfills also write both fields. The convention is
enforced by the test gate, not by remembering to call the right
helper.
== What this commit does NOT do
It does NOT re-publish the migrated alloys to HuggingFace. The local
cache is the source of truth for the test gate; pushing to HF requires
running scripts/republish_alloy_only.py against each migrated alloy,
which is a separate operation that updates the live HF state. That
push happens in a follow-up — the alloy bytes that already shipped
don't break, the new modelHash field just isn't on the HF artifact
until the republish runs. Tier 1 dispatch + Tier 3 sample-hash + Tier 4
canonical pass@1 reproducibility tests all still pass against the
migrated local cache.
It does NOT touch publish_model.py's verify_integrity function in any
way that changes its behavior — the function still computes
hash_model_weights(model_dir), gets back the new convention's
modelHash, and compares to the alloy's claimed modelHash. Old alloys
with the legacy concat-and-hash convention WOULD fail verify_integrity
under the new path; that's correct, because they're using a stale
convention and need migration. New alloys produced through the
unified path verify cleanly.
== Roadmap progress
Step 1 ✓ db54f9d — QwenDenseBase extracted
Step 2 ✓ 903e898 — MoEUnfusedExpertsBase extracted
Step 3 ✓ ae081ea — MoE Tier 2 wiring real
Step 3.5 ✓ 45beb54 — Dense compensation Tier 2 wiring real
Step 4 ✓ 1e90097 — Eval-runner registry on family adapters
Step 5 ✓ forge-alloy 4fd715e — llm-forge domain extension
Step 6 ✓ fd2b249 — Vision-safety integration (QwenVLAdapter)
Step 7 ✓ this commit — modelHash convention unified
Step 8 — priorMetricBaselines.samplesHash + calibration corpus upload (next)
Combined test status:
tests/reproducibility/ 40 passed, 2 xfailed
tests/unit/adapters/ 39 passed
Combined: 79 passed, 2 xfailed in 316s
… ZERO xfails
Eighth and FINAL "correct architecture" roadmap step from
docs/PLUGIN-SPRINT.md. Closes the last 2 xfails in the reproducibility
suite by populating samplesHash on the §4.1.3.4 falsifiability anchor
cells. Every alloy in the cached catalog is now byte-verifiable on every
attestation surface (modelHash, fileHashes, benchmarks[].resultHash,
priorMetricBaselines[].evaluation.samplesHash). The whole reproducibility
chain of custody is closed.
Written test-first per TDD/TDValidation discipline. Test caught both
the missing migration tool and the unpinned cells.
== TDD cycle
1. Wrote tests/unit/adapters/test_prior_baseline_samples_hash.py asserting:
- scripts/migrate_prior_baseline_samples_hash.py exists
- Every priorMetricBaseline cell with a samplesPath also has a samplesHash
- Every recorded samplesHash matches sha256(bytes of the published JSONL)
- The 2 Tier 3 xfails are resolved (≥2 prior-baseline cells pinned)
2. Ran the test — RED, 3 of 4.
3. Built scripts/migrate_prior_baseline_samples_hash.py — one-shot tool
that walks every cached alloy, finds every priorMetricBaselines cell
with a samplesPath but no samplesHash, downloads (or loads from cache)
the samples bytes, computes sha256, writes 'sha256:<hex>' into the
cell's evaluation.samplesHash. Idempotent. Defaults to dry-run;
--confirm rewrites.
4. Ran the migration with --confirm — 2 cells migrated, 15 alloys skipped:
Migrated:
qwen3-coder-30b-a3b-compacted-19b-256k
priorMetricBaselines[router-gate-l2-norm-2026-04-08]
sha256:d401642a75435c77f8b9443b8d0b9a856eff732c19d4968367c333049eeba9fc
(177759 bytes — the §4.1.3.4 negative-baseline cell)
olmoe-1b-7b-compacted-5b
priorMetricBaselines[olmoe-broad-corpus-2026-04-08]
sha256:77bc81ff1f3a2a29b3936c2... (107075 bytes — the cross-arch
within-model A/B negative-baseline cell)
5. Re-ran the test — GREEN, 4 of 4.
6. Ran the full reproducibility + unit suite — 85 passed, 0 skipped,
0 xfailed. The 2 Tier 3 xfails (test_published_alloys_sample_hashes.py
for olmoe-broad-corpus + qwen3-coder router-gate-l2-norm) AUTO-FLIPPED
to PASS because the test code already had the right assertion path
waiting for the field to exist; the migration just populated it.
== What this commit DOES enable
- The §4.1.3.4 falsifiability anchors are now byte-verifiable. Anyone
walking the alloy chain can verify the negative-baseline JSONL bytes
against the alloy's recorded samplesHash, the same way Tier 3 verifies
the forward-claim samples. The methodology paper's "+9.7 HumanEval
points from the metric swap" claim is now fully grounded — both the
positive cell (88.4) and the negative cell (78.7) reproduce from
cryptographically pinned bytes.
- The publish pipeline (alloy_to_card.py / publish_model.py) needs to
learn the new samplesHash field for FUTURE forges so this migration
isn't needed twice. That's a small follow-up — the schema field is
proven via the TDD test, the publish-side wiring is the trivial
second step.
- Every alloy in the cached catalog is now byte-verifiable on every
attestation surface — modelHash + fileHashes + benchmarks[].resultHash
+ priorMetricBaselines[].evaluation.samplesHash. There is no remaining
surface where a producer could silently swap bytes without breaking
a hash check.
== What this commit does NOT do
It does NOT push the migrated alloys to HuggingFace. The 2 affected
HF artifacts (qwen3-coder-30b-a3b-compacted-19b-256k +
olmoe-1b-7b-compacted-5b) need a republish_alloy_only.py run each to
get the new samplesHash field on the live alloy. That's a separate
step — the local cache + the test gate are the source of truth for
the architectural contract; the HF push is the deployment step.
It does NOT upload the calibration corpora alongside the model files.
The §4.1.3.4.1 calibration-corpus discipline gate also requires the
hash-pinned corpora to be PRESENT in each repo. Today the alloy
references calibration/heldout_code300.jsonl by path but the file
doesn't exist on HF. That's an incremental fix on top of Step 8 — the
schema is correct (the calibrationCorpora root extension already
carries sha256), the upload step just hasn't run.
It does NOT add samplesHash to the formal forge-alloy schema's
PriorMetricBaseline class on the forge-alloy repo. The local sentinel-ai
side accepts the field via the existing 'extra=allow' on every BaseModel
(landed in Step 5). Adding a first-class field to the schema definition
on forge-alloy is a follow-up that lets ts-rs generate the TS binding
properly; the existing path works correctly via the extras allow.
== Roadmap COMPLETE
Step 1 ✓ db54f9d — QwenDenseBase extracted
Step 2 ✓ 903e898 — MoEUnfusedExpertsBase extracted
Step 3 ✓ ae081ea — MoE Tier 2 wiring real
Step 3.5 ✓ 45beb54 — Dense compensation Tier 2 wiring real
Step 4 ✓ 1e90097 — Eval-runner registry on family adapters
Step 5 ✓ forge-alloy 4fd715e — llm-forge domain extension
Step 6 ✓ fd2b249 — Vision-safety integration (QwenVLAdapter)
Step 7 ✓ 25e0cb3 — modelHash convention unified
Step 8 ✓ this commit — priorMetricBaselines.samplesHash migrated
ALL 8 STEPS COMPLETE.
Combined test status:
tests/reproducibility/ 46 passed (was 40 + 6 unpinned baselines now resolved)
tests/unit/adapters/ 39 + 4 = 43 passed (Step 8 added 4 new unit tests)
Combined: 85 passed, 0 skipped, 0 xfailed in 316s
== What's now true after the full roadmap
- The adapter set has 6 family adapters under 2 base classes
(QwenDenseBase: Qwen3Dense, Qwen2Dense, QwenVL; MoEUnfusedExpertsBase:
Qwen3MoE, Olmoe). Adding any new model family is one new file.
- Every adapter method has exactly one deterministic code path. No
NotImplementedError stubs. No conditional substitution surface.
- Stage executors are thin dispatchers that delegate to the family
adapter resolved from ctx.alloy['source']['architecture'].
- The eval-runner registry (scripts/eval_runners/) provides the
third axis of dispatch — benchmark name → BenchmarkRunner — which
unblocks frontier targets (SWE-Bench Pro for Qwen3-Coder-480B,
LiveCodeBench v6 for the frontier coder cards, MMMU for VL targets).
- The forge-alloy domain-extension package (forge_alloy/domains/)
proves the schema is genuinely domain-agnostic with the photo-provenance
and ticketing stubs alongside the real llm-forge extension.
- Every published continuum-ai/* artifact has an alloy on HF
(8 backfilled, 9 freshly-shipped, all green at Tier 1 dispatch).
- The modelHash convention is unified across publish + backfill paths
via scripts/alloy_hashing.py — single source of truth, reproducible
from HF metadata alone with no shard downloads.
- Every alloy in the cached catalog is byte-verifiable on every
attestation surface (modelHash + fileHashes + benchmarks[].resultHash
+ priorMetricBaselines[].evaluation.samplesHash). The reproducibility
chain of custody is closed end-to-end on the consumer side.
- vision_safety.py is wired into the family-adapter set via QwenVLAdapter
so future Qwen2.5-VL / Qwen3.5-VL re-forges preserve the vision tower
bit-exact through prune/train/quant — closing the brand-integrity gap
the morning's audit caught for the legacy Qwen3.5 catalog.
- The methodology paper's §4.1.3.4 +9.7 HumanEval claim is now
cryptographically grounded — both the positive cell (88.4) and the
negative-baseline cell (78.7) reproduce from byte-pinned JSONLs.
The architecture is "ready for frontier targets." Adding Mixtral 8x22B,
Qwen3-Coder-480B, DeepSeek-V3.1, or any other future forge target is now
a one-file family adapter + (if the benchmark suite is new) a one-file
eval runner. The forge run is one alloy_executor invocation away.
All 8 "correct architecture" steps landed. 85 passed / 0 skipped / 0 xfailed across the reproducibility + unit suites. Final tally: Step 1 ✓ db54f9d — QwenDenseBase extracted Step 2 ✓ 903e898 — MoEUnfusedExpertsBase extracted Step 3 ✓ ae081ea — MoE Tier 2 wiring real Step 3.5 ✓ 45beb54 — Dense compensation Tier 2 wiring real Step 4 ✓ 1e90097 — Eval-runner registry on family adapters Step 5 ✓ forge-alloy 4fd715e — llm-forge domain extension Step 6 ✓ fd2b249 — Vision-safety integration (QwenVLAdapter) Step 7 ✓ 25e0cb3 — modelHash convention unified Step 8 ✓ d7d4554 — priorMetricBaselines.samplesHash migrated The architecture is "ready for frontier targets." Adding Mixtral 8x22B, Qwen3-Coder-480B, DeepSeek-V3.1, or any other future forge target is now a one-file family adapter + (if the benchmark suite is new) a one-file eval runner. The forge run is one alloy_executor invocation away.
…republish
Post-roadmap "fill our gaps" round. The architecture is built; this
commit USES it to add concrete adapters + runners for the SOTA targets
Kash mapped in the frontier-roadmap analysis. Each addition is a one-
file change, proving the architecture's value proposition.
Written test-first per TDD/TDValidation discipline throughout.
== What landed (5 files added, 3 modified)
scripts/eval_runners/sota_stubs.py — 16 SOTA benchmark runners:
Code: swe_bench_verified, livecodebench_v6, aider_polyglot, mbpp_plus
General: mmlu_pro, gpqa_diamond, ifeval, gsm8k, aime_2024
Vision: mmmu, chartqa, docvqa, ai2d
Audio: covost2, librispeech, gtzan
Each runner declares its name + protocol source in the docstring.
score() raises NotImplementedError LOUDLY with a pointer at the
benchmark protocol doc — when the first frontier forge runs that
needs SWE-Bench Verified or LiveCodeBench v6, the implementer reads
the file, fills in the body, adds a TDD test asserting it scores a
known JSONL fixture, and the corresponding entry in
test_sota_eval_runners.py gets updated to assert the real behavior.
This is NOT the f-word stub pattern — there's no "correct architecture"
code path being silently substituted. The runner exists so dispatch
resolves; calling it before the real implementation lands fails LOUDLY
at the runner site, which is the deterministic-rock signal.
Total registered benchmark runners: 18 (humaneval + humaneval_plus +
16 SOTA stubs).
scripts/adapters/qwen_omni.py — QwenOmniAdapter for Qwen2.5-Omni:
Priority 1 multimodal forge target from Kash's analysis. Apache-2.0,
text+vision+video+audio IN, text+speech OUT in a single inference loop.
Fills the existing 'Qwen3-Omni' product agent slot in Continuum.
Inherits from QwenDenseBase (text-decoder layer is dense Qwen2.5).
Overrides modality() to assert all four encoder/decoder towers are
present (vision_config + audio_config + talker_config + token2wav_config),
builds an omni-safety whitelist covering all four pathways.
Overrides prune() to wrap base.prune with bit-exact pre/post-prune
hash verification on every encoder/decoder tower param. Loud failure
if any of the four towers moves during prune.
Overrides train() to filter LoRA target_modules against the
omni-safety whitelist before delegating to base.train. Drops any
vision-side / audio-side / talker / token2wav projections that match
text-side target_module suffixes.
Architecture string: 'qwen2_5_omni' (verified against
Qwen/Qwen2.5-Omni-7B/config.json).
scripts/adapters/sota_moe.py — 4 SOTA MoE family adapter stubs:
MixtralAdapter ('mixtral') block_sparse_moe-unfused
PhiMoEAdapter ('phimoe') block_sparse_moe-unfused
GraniteMoEAdapter ('granitemoe') granite-moe-fused
DeepSeekV2Adapter ('deepseek_v2') deepseek-routed-shared
Each is structurally distinct from MoEUnfusedExpertsBase's unfused-Qwen
layout, so they don't inherit from it. Per the never-branch rule, each
gets its own adapter file with its own expert_prune that knows its
family's tensor walk. expert_prune() raises NotImplementedError today
with a layout-specific message naming the expected discriminator
('block_sparse_moe-unfused' / 'granite-moe-fused' / 'deepseek-routed-shared')
so the implementer knows which tensor walk to write.
When the first Mixtral 8x22B forge runs (Joel's stated single-5090
frontier target — first single-GPU 8x22B will be the headline), the
MixtralAdapter stub gets a real expert_prune body in a focused commit
gated by its own TDD test. Same pattern for the other three.
scripts/republish_alloy_only.py — added --allow-modelhash-migration flag:
The defensive modelHash-unchanged check correctly refuses normal
re-publishes that change modelHash (signaling weight change). The
Step 7 convention migration is the legitimate exception (same
bytes, new convention). Flag is opt-in, default stays strict, the
migration use case has its own surface.
Used to push the 9 alloys whose modelHash changed convention in the
Step 7 migration → all 9 now live on HF with the canonical convention.
== HF re-publish (closed the local-vs-HF drift)
11 alloys republished total:
Step 8 (samplesHash) re-publishes (3 alloys, no flag needed):
olmoe-1b-7b-compacted-5b → alloyHash 6d679da673f5fd3e
qwen2.5-coder-7b-compacted → alloyHash 4fe422e9b01fa8f0
qwen3-coder-30b-a3b-compacted-19b-256k → alloyHash 821f156287020528
Step 7 (modelHash convention) re-publishes (6 alloys, --allow-modelhash-migration):
qwen3.5-0.8b-general-forged → alloyHash e34c50597ffd15aa
qwen3.5-2b-general-forged → alloyHash b2006ad368386543
qwen3.5-4b-code-128k-forged → alloyHash a4da7dea5bb8d3d9
qwen3.5-4b-code-forged → alloyHash 435ff486e11ed54d
qwen3.5-4b-general-forged → alloyHash 86000c4ca4a65fe8
qwen3.5-9b-general-forged → alloyHash abfc8de0afe02b22
Each push refreshed alloy.json + README.md + alloy-qr.png atomically.
Verified post-push: live HF state byte-identical to local cache for
every migrated alloy.
== Test status
tests/reproducibility/ 46 passed
tests/unit/adapters/ 91 passed
(was 39, added 52 from the gap-fill round:
33 SOTA runner + 6 omni + 13 SOTA MoE)
Combined: 137 passed, 0 skipped, 0 xfailed in 320s
== Adapter set inventory
scripts/adapters/
├── base.py ← FamilyAdapter ABC
├── registry.py ← AdapterRegistry singleton
├── dispatch.py ← resolve_adapter_chain
├── qwen_dense_base.py ← shared dense Qwen behavior
│ ├── qwen3_dense.py ← qwen3_5
│ ├── qwen2_dense.py ← qwen2
│ ├── qwen_vl.py ← qwen2_5_vl + qwen3_5_vl (vision_safety)
│ └── qwen_omni.py ← qwen2_5_omni (omni-safety, four-tower)
├── moe_unfused_base.py ← shared MoE-unfused-Qwen behavior
│ ├── qwen3_moe.py ← qwen3_moe
│ └── olmoe.py ← olmoe
└── sota_moe.py ← Mixtral / Phi-MoE / GraniteMoE / DeepSeek-V2
(each their own structurally novel layout)
Total: 11 family adapters across 2 base classes + 4 stub adapters.
Architecture strings covered: qwen3_5, qwen2, qwen2_5_vl, qwen3_5_vl,
qwen2_5_omni, qwen3_moe, olmoe, mixtral, phimoe, granitemoe, deepseek_v2
scripts/eval_runners/
├── base.py ← BenchmarkRunner ABC + ScoreResult
├── registry.py ← BenchmarkRunnerRegistry
├── humaneval.py ← REAL (wraps the canonical evalplus scorer)
├── humaneval_plus.py ← REAL
└── sota_stubs.py ← 16 SOTA stubs (raise NotImplementedError loudly)
Total: 18 registered benchmark runners (2 real + 16 stubs).
== What this commit DOES enable
- Adding any new SOTA family forge to the dispatch test catalog is
one new entry — the adapter is already registered for every
architecture string Kash's frontier-target list mentions.
- Adding a new SOTA benchmark to a frontier alloy is a single
runner-class implementation (move out of sota_stubs.py into its
own file, fill in score()) plus a TDD test.
- The Mixtral 8x22B forge target (Joel's stated single-5090
frontier headline) only blocks on:
1. The MixtralAdapter expert_prune body for the
block_sparse_moe-unfused layout — one focused commit
2. The LiveCodeBench v6 + SWE-Bench Verified runner bodies for
the eval stage — two focused commits per benchmark
3. A 5090 to actually run the forge against the published Mixtral
8x22B base
Architectural surface area: zero. No changes to any base class,
no changes to alloy_executor, no changes to the dispatch path.
- The Qwen2.5-Omni forge target (Priority 1 multimodal) only blocks on:
1. The omni Tier 2 wiring for the forge_model.prune call against
a thinker.layers walk (the inherited Qwen2.5 dense path
already handles this — just needs the omni model to load)
2. A 5090 to actually run the forge against Qwen2.5-Omni-7B
Architectural surface area: zero.
== Roadmap status
Plugin sprint: 8 of 8 steps DONE (commits db54f9d through d7d4554)
Gap-fill round: THIS COMMIT
16 SOTA eval runners registered
5 SOTA family adapters registered (Omni + 4 MoE)
11 alloys republished to HF (cache/HF drift closed)
Remaining for the first SOTA forge run:
- 5090 hardware time (BigMama)
- Implement one MixtralAdapter.expert_prune (or one of the others)
- Implement one or two SOTA eval runners (LiveCodeBench v6 +
SWE-Bench Verified for the Qwen3-Coder-480B headline play)
All "iterate on this" work for the next session can pick up from
this state via the design doc at docs/PLUGIN-SPRINT.md.
…al (TDD)
Hard prerequisite for the Mixtral 8x22B + Qwen3-Coder-480B + DeepSeek-V3.1
frontier forge plays. Per Kash's frontier-target analysis (the convo-with-kash
work, 2026-04-08): HumanEval is dead for frontier coder cards. Every
modern frontier coder model (Qwen3-Coder, Qwen3-Coder-480B, DeepSeek-V3.1,
Mixtral 8x22B, GPT-4) reports against LiveCodeBench v6 instead because
LCB v6 is the contamination-free "problems published after a fixed
cutoff" successor that hasn't been in any model's training set. The
§4.1.4.1 anchor-reproduction discipline gate cannot run on any frontier
forge target until the calibrated eval pipeline supports LCB v6.
This commit is the first of the SOTA stubs (the 16 added in the
gap-fill round) to graduate to a real implementation. The stub
pattern from sota_stubs.py — registered class, NotImplementedError
score() body — gets replaced by a dedicated module file with a real
score() body that lazy-imports lcb_runner and invokes its canonical
codegen_metrics function on an existing samples JSONL.
Written test-first per TDD/TDValidation discipline.
== TDD cycle
1. Wrote tests/unit/adapters/test_livecodebench_v6_runner.py asserting:
- eval_runners.livecodebench_v6 module is importable on a Mac
WITHOUT lcb_runner installed (lazy import inside score)
- LiveCodeBenchV6Runner is registered in the singleton via the
dedicated file (not the sota_stubs stub)
- .name class attribute is 'livecodebench_v6'
- score() body is REAL (references lcb_runner, not _stub_score_raise)
- score() raises a CLEAR ImportError on a machine without lcb_runner,
naming lcb_runner + the install path
- sota_stubs.py no longer carries the LiveCodeBenchV6Runner class
(would otherwise cause a duplicate-registration conflict)
- score() returns a properly-shaped ScoreResult when lcb_runner IS
installed (skipped on Mac, runs on BigMama / CI containers)
2. Ran the test — RED, 6 of 7.
3. Built scripts/eval_runners/livecodebench_v6.py:
- LiveCodeBenchV6Runner with name='livecodebench_v6'
- score(samples_path) lazy-imports
lcb_runner.evaluation.compute_code_generation_metrics.codegen_metrics
and lcb_runner.benchmarks.code_generation.load_code_generation_dataset
- Loads the canonical release_v6 dataset (pinned — if LCB ships v7,
that gets a new file/runner, old alloys keep resolving to v6)
- Parses the samples file in either of two accepted formats:
a) JSONL with task_id + output_list per line
b) Single JSON file in lcb_runner's
output/{model_repr}/codegeneration_{n}_{temp}.json shape
- Calls codegen_metrics(samples, problems, k_list=[1]) and returns
a ScoreResult with pass_at_1 normalized to the 0..1 fraction
convention the registry uses
- Carries release_version + k_list + problem_count in extras for
forensic visibility
- Loud failures throughout: FileNotFoundError if samples_path is
missing, ImportError pointing at the install path if lcb_runner
isn't there, ValueError if the samples file has no parseable
records
4. Removed LiveCodeBenchV6Runner from scripts/eval_runners/sota_stubs.py
(the class definition AND the entry in REGISTRATIONS) so the
registry doesn't see a duplicate.
5. Wired the new module into scripts/eval_runners/__init__.py — eager
import + register() call alongside humaneval / humaneval_plus.
6. Ran the test — GREEN, 6 of 7 (+1 skipped because lcb_runner isn't
installed in this venv; that test will run on any machine where
lcb_runner is present).
7. Updated tests/unit/adapters/test_sota_eval_runners.py to drop
livecodebench_v6 from the SOTA_BENCHMARKS stub list — it's no
longer a stub, its coverage moved to test_livecodebench_v6_runner.py.
8. Ran the full reproducibility + unit suite — 141 passed, 1 skipped,
0 xfailed (up from 137; net +5: 6 new LCB v6 tests + 1 dropped
sota stub check + 1 skipped test).
== What this commit DOES enable
- The §4.1.4.1 anchor-reproduction discipline gate can now resolve
LCB v6 through the registry. Future forges that declare LCB v6
in their alloy's eval.benchmarks[] dispatch through the new
runner (via FamilyAdapter.eval) and get a real ScoreResult back,
not a NotImplementedError stub.
- Mixtral 8x22B forge: only blocks on MixtralAdapter.expert_prune
body now (LCB v6 scoring is the OTHER prerequisite, satisfied here).
- Qwen3-Coder-480B forge: only blocks on the multi-GPU sharding for
the 50GB+ shard streaming pruner (the dispatch + scoring contracts
are both green).
- DeepSeek-V3.1 forge: only blocks on a DeepSeek-V3 adapter file
(not yet built; the existing DeepSeekV2Adapter is for V2 which has
the routed+shared layout; V3 may have the same layout or may not,
needs research).
== What this commit does NOT do
It does NOT install lcb_runner in the sentinel-ai venv. lcb_runner
brings vLLM and a heavy CUDA dep stack that would balloon the venv
unnecessarily for the dispatch-only Mac path. The runner is
importable and registered via the lazy import; actual scoring runs
on any environment that has lcb_runner installed (BigMama, the
eval-runner containers, the forge worker pods).
It does NOT wire eval_with_calibration.py's discipline gate to use
the new registry path. The existing run_livecodebench_v6 function in
that file still does its own subprocess-shell to lcb_runner.runner.main
for the codegen+evaluate path. Unifying the two is a follow-up that
extracts the codegen half into scripts/eval_runners/livecodebench_v6.py
as a `generate(model, output_dir)` companion to score(); for now the
two halves coexist (codegen in eval_with_calibration.py, scoring via
the new runner) and they produce identical results because both invoke
the same lcb_runner internals.
It does NOT score a real LCB v6 JSONL end-to-end on this Mac. The
contract test ASSERTS the lazy import + the import-error path, which
is the surface the runner needs to expose; the actual scoring runs
on any machine with lcb_runner installed and produces a real
ScoreResult (the test_score_returns_score_result_shape test gates
that path on machines where it can run).
== Test status
tests/reproducibility/ 46 passed
tests/unit/adapters/ 95 passed (was 91; +4 net for LCB v6
vs the dropped sota stub)
Combined: 141 passed, 1 skipped, 0 xfailed in 316s
== Frontier-target progress
Mixtral 8x22B (single-5090 headline play):
✓ MixtralAdapter registered (block_sparse_moe-unfused stub)
— MixtralAdapter.expert_prune body (NEXT — block_sparse_moe-unfused
tensor walk; the same pattern as cpu_expert_prune_v2.py but for
the Mixtral layout)
✓ LiveCodeBenchV6Runner real implementation (THIS COMMIT)
— 5090 time on BigMama
Architectural surface area to ship: ZERO. Implementation surface
area: 1 expert_prune body (~1-2 days mechanical work).
Qwen3-Coder-480B (multi-GPU grid play):
✓ Qwen3MoEAdapter handles the architecture (same family as the
morning's 30B-A3B; just bigger; no code change)
✓ LiveCodeBenchV6Runner real implementation (THIS COMMIT)
— Multi-GPU sharding extension to the streaming safetensors pruner
(scripts/cpu_expert_prune_v2.py works on a single-machine model
directory today; needs to handle shards distributed across GPUs)
— Multi-machine grid time
Architectural surface area: zero. Implementation surface: 1 multi-
GPU streaming refactor + grid harness.
DeepSeek-V3.1 (Tier 2, MIT license):
✓ DeepSeekV2Adapter for the V2 family (V3 may need its own adapter
file if the layout differs structurally)
✓ LiveCodeBenchV6Runner real implementation (THIS COMMIT)
— V3-specific expert_prune body
— 5090+ time
…dit clean
Two pieces in one commit:
1. AUDIT — verified the architecture is solid against the deterministic-rock
principle. Found one f-word smell (silent substitution in MoE base) and
fixed it. Verified all migration scripts are idempotent (zero drift).
Verified every published continuum-ai/* alloy resolves through a
registered adapter (17/17). Verified every cached alloy still validates
against the new schema (forge-alloy regression: 3/3 round-trip clean).
2. MIXTRAL EXPERT PRUNE REAL — second SOTA stub graduates to a real
implementation. The first was LiveCodeBenchV6Runner; this is the
family-side complement that unblocks the Mixtral 8x22B headline play.
== Audit findings (fixed in this commit)
scripts/adapters/moe_unfused_base.py:
Found the f-word pattern at line 264:
src_model_dir = getattr(ctx, "source_model_dir", None) or ctx.model_name
The `or ctx.model_name` silently substitutes a HF id (which isn't a
local disk path) for a missing source_model_dir. The next line's
Path.exists() check would still catch it, but the `or` itself is the
silent-substitution surface the f-word rule prohibits.
Fixed: split into two explicit guards. First raises if
source_model_dir is None with a clear message ("ctx.model_name is NOT
a substitute"); second raises if the path doesn't exist on disk.
Two named errors, one for each failure mode, both loud.
== Mixtral wiring — TDD cycle
1. Wrote tests/unit/adapters/test_mixtral_expert_prune.py asserting:
- cpu_expert_prune_v2 exposes a LayoutSpec dataclass
- QWEN3_MOE_LAYOUT exists as a module constant matching the morning's
flagship's tensor name patterns (mlp.experts.{e}.{gate,up,down}_proj)
- MIXTRAL_LAYOUT exists for block_sparse_moe.experts.{e}.{w1,w2,w3}
- MIXTRAL_LAYOUT regexes match REAL Mixtral tensor names from
Mixtral-8x7B-Instruct-v0.1's published safetensors index
- QWEN3_MOE_LAYOUT does NOT match Mixtral names (and vice versa) —
cross-contamination would be a refactor bug
- prune_experts() takes a layout=LayoutSpec parameter
- MixtralAdapter.expert_prune body calls prune_experts(layout=MIXTRAL_LAYOUT),
no longer the _stub_expert_prune_raise stub
- Tier 1 dispatch path (ctx.model is None) short-circuits cleanly
- END-TO-END: a synthetic in-memory Mixtral-shaped model directory
(3 layers × 4 experts, ~30KB) gets pruned to 2 experts/layer via
prune_experts(layout=MIXTRAL_LAYOUT), and the output safetensors
contains exactly the renumbered expert indices {0, 1} (not the
original 4-expert layout). The sidecar declares
selection.layout_family='mixtral' and the per-layer kept indices
match the algorithm's selection.
2. Ran the test — RED, 9 of 10.
3. Refactored cpu_expert_prune_v2.py:
- Added LayoutSpec dataclass with family_name + gate_pattern +
expert_pattern + expert_rename_template fields. Helper methods
gate_re() / expert_re() return compiled regexes.
- QWEN3_MOE_LAYOUT module constant pinned to the morning's flagship's
exact patterns (mlp.experts.{e}.{gate,up,down}_proj.weight) so the
existing forge path keeps working with no behavior change.
- MIXTRAL_LAYOUT module constant for block_sparse_moe.experts.{e}.{w1,w2,w3}.weight
with the rename template for the same path prefix.
- Backward-compat module-level ROUTER_GATE_RE / EXPERT_TENSOR_RE
constants point at QWEN3_MOE_LAYOUT.gate_re() / .expert_re() so
any external import keeps working too.
- Threaded `layout: LayoutSpec = QWEN3_MOE_LAYOUT` parameter through
read_router_gates(), stream_rewrite(), prune_experts(). All callers
that don't pass layout= get the default (Qwen3MoE behavior unchanged).
- stream_rewrite uses layout.gate_re() / layout.expert_re() instead
of the module-level constants.
- The expert renaming uses layout.expert_rename_template.format(...)
instead of the hardcoded f-string, so each family writes its own
surviving-expert names.
- The sidecar selection block now records layout_family for forensic
visibility ("mixtral" vs "qwen3_moe" vs future families).
- prune_experts's "no router gates found" error message now names
the expected pattern from the layout spec, not the hardcoded
mlp.gate path.
4. Wired MixtralAdapter.expert_prune real body in scripts/adapters/sota_moe.py:
- Lazy-imports cpu_expert_prune_v2.prune_experts + MIXTRAL_LAYOUT
- Validates expertTensorLayout is 'block_sparse_moe-unfused' (raises
loudly if the alloy declares a different layout)
- Validates ctx.source_model_dir is set + exists on disk
- Validates ctx.importance_json_path is set when strategy is
calibration-aware-activation-count (the §4.1.3.4 path)
- Calls prune_experts(layout=MIXTRAL_LAYOUT) — same algorithm as
the morning's flagship, different tensor name patterns
- Reloads ctx.model from the pruned dir for downstream stages
- Frees the original model's GPU memory before the reload
Also wired MixtralAdapter.expert_activation_profile with the same
lazy-import + delegation pattern to expert_activation_profile.profile_experts.
The script's named_modules() walk picks up Mixtral's
block_sparse_moe.gate hooks via the cross-architecture portability
fixes from sentinel-ai commit 488b740 — no change needed there.
5. Re-ran the Mixtral test — GREEN, 10 of 10. The end-to-end synthetic
Mixtral pipeline ran:
3 router gates read
18 expert tensors renamed to surviving indices
18 expert tensors dropped
3 router gates sliced
Output shards written with renumbered experts {0, 1}
config.json updated to num_local_experts=2
Sidecar declares layout_family='mixtral'
6. Ran the full reproducibility + unit suite — 151 passed, 1 skipped,
0 failures (up from 141; +10 net for the new Mixtral tests).
== What this commit DOES enable
- Mixtral 8x22B forge: NOW only blocks on 5090 time on BigMama. The
architectural surface area is zero. The implementation surface
area is zero (the layout-aware pruner handles Mixtral the same
way it handles Qwen3MoE — same algorithm, different name patterns).
LCB v6 runner (the previous commit) is the eval-side prerequisite;
this commit is the family-side prerequisite. Both done.
- Phi-MoE forge: shares the block_sparse_moe-unfused layout with
Mixtral. PhiMoEAdapter inherits the same pattern; its expert_prune
body lights up by adding `layout=MIXTRAL_LAYOUT` to the call site
(1-line change) when the first Phi-MoE forge runs.
- Future block_sparse_moe-unfused families (any Mistral / Mixtral-style
MoE that ships) inherit the same layout. Adding the family is one
new file.
- GraniteMoE-fused and DeepSeek-V2-routed-shared still need their own
LayoutSpec entries — fused experts and routed+shared layouts are
structurally distinct from unfused. Their adapter stubs remain
NotImplementedError until the layout-specific pruners are written
(separate commits). The architectural pattern is set; adding either
is a new LayoutSpec constant + a new code path in stream_rewrite OR
a separate streaming pruner script for the structurally novel cases.
== What this commit does NOT do
It does NOT run a real Mixtral 8x22B forge end-to-end. The end-to-end
test uses a SYNTHETIC 3-layer × 4-expert × hidden=8 fixture (~30KB
total) that exercises the full Pass 1 + Pass 2 streaming rewrite and
verifies the output structure. Real Mixtral 8x22B is 280GB on disk;
forging it requires a 5090 with the unmodified base loaded (or shards
on local disk for the streaming path).
It does NOT extract a base class from MixtralAdapter +
MoEUnfusedExpertsBase. Both share a similar shape (lazy-import the
pruner, validate ctx fields, call prune_experts, reload model), but
per the OOP rule we don't extract a base off two examples that
haven't both been forge-validated yet. After Mixtral 8x22B actually
ships and the second block_sparse_moe-unfused forge (Phi-3.5-MoE) is
proven, the right base extraction becomes obvious.
It does NOT add eval_with_calibration.py wiring for the §4.1.4.1
discipline gate. The LCB v6 runner is registered through the new
registry, but the existing eval_with_calibration.run_livecodebench_v6
function still does its own subprocess shell to lcb_runner.runner.main
for the codegen+evaluate path. Unifying that with the new runner is a
follow-up commit.
== Frontier-target status after this commit
Mixtral 8x22B (single-5090 prosumer headline play):
✓ MixtralAdapter.expert_prune REAL via MIXTRAL_LAYOUT (THIS COMMIT)
✓ MixtralAdapter.expert_activation_profile REAL (THIS COMMIT)
✓ LiveCodeBenchV6Runner REAL (commit b4294cf)
— 5090 time on BigMama
Architectural surface area: ZERO
Implementation surface area: ZERO
The Mixtral 8x22B forge can be RUN today on a 5090 with the
base model loaded. The forge would walk the alloy through:
modality / source-config (no-op for dense Mixtral) →
expert-activation-profile (Mixtral's mlp.gate hooks via
the portable expert_activation_profile.py) →
expert-prune via MIXTRAL_LAYOUT (THIS COMMIT, end-to-end tested
on the synthetic fixture) →
quant + eval
Qwen3-Coder-480B (multi-GPU grid play):
✓ Qwen3MoEAdapter handles the architecture (same family as the
morning's 30B-A3B; bigger geometry, no code change)
✓ LiveCodeBenchV6Runner REAL
— Multi-GPU sharding extension to the streaming pruner
— Multi-machine grid time
Same status as before: zero architectural surface, just needs
multi-GPU shard streaming + grid time.
Phi-3.5-MoE: 1-line change (add layout=MIXTRAL_LAYOUT to the call
site in PhiMoEAdapter — already inherits the same layout from this
commit). Could land in 5 minutes.
== Test status
tests/reproducibility/ 46 passed
tests/unit/adapters/ 105 passed (was 95; +10 from the new
Mixtral test file)
Combined: 151 passed, 1 skipped, 0 xfailed in 318s
The forge + family-adapter set + eval registry now plug into a
disk-backed queue + worker. Drop an alloy in .factory/queue/pending/,
the worker forges → evals → publishes → moves to done/. Failures land
in failed/ with a full traceback. The filesystem IS the queue.
Five rounds of work, all green, +55 tests (122 → 177):
1. Phi-3.5-MoE inheritance graduation
PhiMoEAdapter inherits from MixtralAdapter — zero duplicated body.
Both families share the block_sparse_moe-unfused layout exactly;
inheritance is the degenerate form of base extraction. When a third
sibling ships, rename MixtralAdapter to BlockSparseMoEUnfusedBase.
2. DeepSeek-V2 routed/shared pruner
DEEPSEEK_V2_LAYOUT in cpu_expert_prune_v2 + real DeepSeekV2Adapter
body. Shared experts and the dense first layer are verified
bit-exact in the synthetic E2E test (the always-fires capability
the model relies on cannot be pruned).
Also adds n_routed_experts to update_config for DeepSeek configs.
3. Open LLM Leaderboard v2 runner pack
LmEvalHarnessRunner base + 6 thin subclasses (IFEval, BBH,
MATH-Hard, GPQA, MMLU-Pro, MuSR). One base does all the harness
wiring, six subclasses just declare task_name + metric_key. The
IFEval/MMLU-Pro/GPQA-Diamond stubs in sota_stubs are graduated and
removed from REGISTRATIONS to prevent double-registration.
4. eval_with_calibration → BenchmarkRunner registry migration
The hand-rolled if-elif dispatch in run_benchmark is replaced with
resolve_runner(name). NOT_YET_IMPLEMENTED dict deleted — the
registry is the single source of truth. Stubs raise
NotImplementedError from a new ABC default evaluate(). The §4.1.4.1
anchor-reproduction discipline gate now uses the same axis as
production scoring.
5. factory_queue.py — the BigMama production loop
FactoryQueue (disk-backed pending/running/done/failed) plus
FactoryWorker (process_one + run_loop). Executor and publisher are
injected so unit tests pass fakes; production CLI wires
alloy_executor.execute_alloy + publish_model.publish.
Standing directive section added to docs/PLUGIN-SPRINT.md and the
sentinel-ai README — the priority queue, the bug-first eval frame
('big drop = algorithmic failure first, model second' from the
§4.1.3.4 win), and the architectural diagram.
177 passed, 1 skipped across the adapter suite.
Catalog of the empty-quadrant viral targets Kash mapped, materialized as minimal intent alloys droppable into .factory/queue/pending/. Each one is just enough alloy for alloy_executor to dispatch through the family adapter set + the eval-runner registry; the publish stage fills in the prose-heavy model card fields downstream. The 9 candidates: 1. mixtral-8x22b-instruct-compacted-70b — single-5090 prosumer headline 2. mixtral-8x7b-instruct-compacted-24b — smaller sibling 3. phi-3-5-moe-instruct-compacted-22b — 16->8 experts via PhiMoEAdapter 4. deepseek-v2-lite-chat-compacted — 64->32 routed, shared bit-exact 5. olmoe-1b-7b-0924-instruct-compacted — second 4.1.3.4 anchor 6. qwen3-coder-30b-a3b-compacted-19b-256k-v2 — flagship re-publish 7. qwen3-vl-8b-instruct-compacted — VL with vision_safety 8. qwen3-vl-30b-a3b-instruct-compacted — vision tower + MoE pruner 9. qwen2-5-omni-7b-compacted — 4-tower omni whitelist Every text LLM in the catalog runs the full Open LLM Leaderboard v2 pack (IFEval/BBH/MATH-Hard/GPQA/MMLU-Pro/MuSR) plus the code pack (HumanEval/HumanEval+/LCB v6) where applicable. Vision targets run the 4-benchmark VL pack (MMMU/ChartQA/DocVQA/AI2D — currently stubs, will graduate when first VL forge runs).
Two architectural corrections in one round:
1. PUBLISHER IS OFF BY DEFAULT.
Sentinel-ai's job is forge + assay. Continuum is the publication
gatekeeper. Auto-publishing from sentinel was never the plan and was
wrong to wire in. FactoryWorker.publisher is now Optional[Callable]
with default None; the production CLI requires --publish to opt in
(intended only for staging-environment integration tests). Continuum
reads finished/ on its own schedule and decides what ships.
2. ASSEMBLY-LINE METAPHOR.
Toyota Production System is a cleaner mental model than alchemy for
what this loop actually is. Renamed:
queue/pending/ → line/intake/ parts entering the line
queue/running/ → line/assembly/ currently being built
queue/done/ → line/finished/ in the shipping bay
queue/failed/ → line/rework/ QA-flagged, needs human
Method renames track:
pop_oldest_pending → pop_oldest_intake
mark_done → mark_finished
mark_failed → mark_rework
STATIONS replaces BUCKETS as the iteration constant.
The metaphor makes the gate question architecturally crisp: the gate
isn't on the alloy, it's at the shipping door (continuum). The alloy
declares targets; continuum's release flow reads the eval results in
finished/ and decides ship/rework. Sentinel never has to know what
'good enough' means — that's a continuum policy decision, downstream
of the assembly line.
Seed catalog re-runs cleanly into .factory/line/intake/ — 9 viral
targets queued. Diagram updated in both sentinel-ai/README.md and
continuum/docs/architecture/FACTORY-PIPELINE-UI.md.
177 passed, 1 skipped.
The 9 viral targets in the seeder catalog now ship with the part spec
attached — each alloy carries its own acceptanceCriteria block declaring
the floors continuum will gate against in the shipping department.
Three helpers:
_coder_acceptance(max_vram_gb, anchor_delta_pp=-3.0)
humaneval_plus floor 0.55, plus the 4.1.3.4 anchorDelta gate
(forged score within |delta| points BELOW the base anchor in the
same eval pipeline). Default delta is -3.0; the qwen3-coder-30b
v2 re-forge declares -3.7 to lock the morning flagship's gate.
_general_acceptance(max_vram_gb)
Open LLM Leaderboard v2 floors at the median of the current
public leaderboard for each weight class.
_vl_acceptance(max_vram_gb)
Vision-language floors: MMMU 0.40, ChartQA 0.50, DocVQA 0.55,
AI2D 0.55.
Reseeded into .factory/line/intake/ — all 9 alloys carry their gates.
Continuum's shipping flow (separate, not yet built) will read these
off the finished/ manifest and decide ship/rework.
…t name
'domain' in make_dataloaders is a registry key from a fixed enum:
('code' | 'reasoning' | 'general' | 'chat' | 'science'). The actual
HF dataset (e.g. 'Salesforce/wikitext') is mapped FROM the key
inside make_dataloaders, not stored as the value.
Previously default_train_params returned domain='wikitext' which
got rejected as 'Unknown domain wikitext' downstream. Fix: return
'general' (the key for text recovery) for non-coder models, 'code'
for coder models.
The 'dataset' field is also dropped since it's redundant — the
domain key picks the dataset.
234/234 still passing.
…p (CRITICAL eval bug)
The eval pipeline was producing perplexity ~10-30x worse than reality
across every published model. Granite shipped with baseline ppl 105
(real: 9.28), Qwen2.5-7B-Instruct shipped with baseline ppl 263
(real: 8.70). Both forge cards have been updated with withdrawal
notices.
Root cause:
out = model(input_ids=ids, attention_mask=mask, labels=ids)
~~~~~~~~~~
labels=ids passed the input_ids as labels with NO PAD MASKING. The
make_dataloaders tokenizer uses padding='max_length' (line 448) which
pads every sample to cfg.seq_len (typically 2048). A 50-token wikitext
sample becomes 50 valid tokens + 1998 pad tokens. The model's CE loss
then computes loss across ALL 2048 positions including the 1998 pad
positions, where the model has no signal — it produces near-uniform
logits at pad positions giving loss ~ ln(vocab_size) ~ 12.
Average that 12 across 1998 pad positions with the 6-ppl real loss
on 50 valid tokens and you get the inflated ~250-300 ppl figures we
shipped. This was wrong in BOTH the evaluate() pipeline AND the train
loop (the LoRA recovery was learning to predict pads, not language).
Fix (one line, two places):
labels = ids.clone()
labels[mask == 0] = -100 # HF ignore sentinel for CE loss
out = model(input_ids=ids, attention_mask=mask, labels=labels)
This is the standard HuggingFace pattern. The CE loss function skips
positions where labels == -100, so the resulting loss is the average
over VALID tokens only.
Now BOTH evaluate() and the LoRA training inner loop apply the mask.
The next forge run will produce honest baseline numbers and a real
LoRA recovery (no more 'learning to predict pads').
234/234 unit tests still passing. Real verification needs a re-eval
on bigmama against the published artifacts.
…t (CRITICAL load bug) Two bugs surfaced by re-evaluating the published qwen2-5-7b-instruct-compacted: RuntimeError: You set 'ignore_mismatched_sizes' to 'False', thus raising an error. The model's saved config.json claimed N attention heads but the actual safetensors had a different shape per layer. Loading via AutoModelForCausalLM.from_pretrained failed for everyone. The artifact was published, looked successful, but was non-functional. ROOT CAUSE — defrag mode + per-layer shape divergence: QwenDenseBase.prune() called defrag_live_model() without specifying a mode, which defaulted to 'slice'. Slice mode physically removes pruned head rows from q_proj/o_proj. When different layers prune different head counts (which happens when the importance metric is per-layer non-uniform), each layer ends up with a DIFFERENT q_proj shape. But model.config.num_attention_heads is a single scalar that can only describe ONE shape. The saved config matches layer 0 and mismatches every other layer. FIX 1 — adapter level, never branch the code path: defrag_live_model(ctx.model, dead_heads=heads, mode='pad') Pad mode preserves the original q_proj wire shape and zeros dead head positions in place. All layers stay uniformly shaped, the saved config matches every tensor, from_pretrained() works for everyone downstream. Tradeoff: the saved safetensors are slightly larger (zeroed dead head positions are still stored), but the artifact is loadable, which is the only requirement that matters. FIX 2 — save-then-reload smoke test in forge_model.py: After save_pretrained(), immediately try to load the just-saved model via AutoModelForCausalLM.from_pretrained(model_dir). If it fails to load, raise RuntimeError with a clear pointer to the defrag/config mismatch. Catches THIS class of bug (and any future shape-divergence bug) at forge TIME, not at publish time. The smoke test is the architectural fix for 'we shipped an artifact nobody can load'. It's the same shape as the §4.1.4.1 anchor- reproduction discipline gate but applied to the loader contract: the forge MUST produce a model that anyone with vanilla transformers can load. If the smoke test fails, the forge fails. No silent skip. 234/234 unit tests still passing.
…ode (auto-recovery)
The pattern that just got us BigMama back online (idempotent post-
power-failure recovery) deserves to live in the repo, not in our
heads. bootstrap-hive-node.sh codifies the 7 things every forge grid
node needs to come back from a power failure / drive install / fresh
ubuntu install:
1. Generate ed25519 SSH key for github (idempotent)
2. Add github.com to known_hosts
3. Persist HF_TOKEN + WSL nvidia-smi PATH to ~/.bashrc BEFORE the
non-interactive guard so 'ssh node command' inherits them
4. Install ~/start-factory-daemon.sh wrapper (one-command recovery)
5. Verify ssh.socket + tailscaled autostart so the node comes back
online after every power-failure
6. Print the public key for the operator to register on github
7. Validate github auth (skips if not yet registered)
Designed for the typical scenarios:
- Post-power-failure recovery (BigMama 2026-04-09)
- Fresh ubuntu install on a new donated 5090
- Switching from HTTPS git auth to SSH auth
- Re-running after partial setup
Every step is idempotent, every step prints what it did, no step
silently fails. Once a node passes this script clean, it can be
remote-controlled from FlashGordon (or any operator box) without
further interactive setup.
Joel: 'this is how its done... will need to be part of the setup
built into windows/wsl maybe linux period (people are having ubuntu
install issues)'.
Also includes:
- .gitignore: .factory/ (per-node queue state, never commit)
- PLUGIN-SPRINT.md: today's session writeup with the 2 shipped+pulled
models, the 17 bugs caught, the live forge story, what shipped
Per-node config that overrides auto-detection. Lives at
<queue_root>/factory_node.toml. Single source of truth for which
storage paths belong to which cache tier on this node.
The mental model is L0..L5 cache hierarchy:
L0 GPU VRAM volatile, microseconds, $$$$
L1 System RAM volatile, nanoseconds, $$$
L2 Hot SSD persistent, ~50µs, $$
L3 Cold HDD persistent, ~5ms, $
L4 Network archive persistent, seconds, $
L5 HuggingFace re-fetchable, infinite, free
factory_node.toml only describes L2+. The grid (continuum) eventually
reads this file across all nodes to make routing decisions: 'don't
push a Mixtral 8x22B forge to a node whose hot tier has only 500GB
free; pick the node with the WD Red Pro 16TB cold tier instead.'
New types in factory_storage.py:
ColdTier — one declared cold tier (name, path, fs_type,
write_mb_per_sec, purpose)
FactoryNodeConfig — top-level config (node, hot, cold tiers, grid)
.from_file(path) — load from TOML, returns None on missing/invalid
.first_cold_path()— convenience for auto_cleanup integration
auto_cleanup() now accepts config_aware=True. When set:
- explicit cold_root parameter still wins (operator override)
- else load factory_node.toml from root and use first cold tier
- else fall back to delete-and-let-HF-refetch (current behavior)
FactoryWorker.process_one() passes config_aware=True so the daemon's
cleanup pass picks up factory_node.toml automatically. The CLI
--cleanup-cold-root is now optional — set it for explicit override,
omit it to let the config decide.
bootstrap-hive-node.sh now writes ~/factory_node.toml.example so any
fresh node has a sensible template to copy and customize.
The pattern: declarative config wins, auto-detection is the bootstrap
fallback, both coexist. Future continuum grid layer reads the same
config file remotely to coordinate multi-node forges.
9 new tests, 243/243 passing (was 234).
… speed - docs/FACTORY-PROTOCOL.md: disk protocol as API contract (Kash's most-important-deliverable). Directory layout, file schemas, state machine, consumer contract, extensibility for non-forge workloads, risk register, versioning. Includes Kash review refinements: ship role definition, alloyChainHash + signatureBundle + priorMetricBaselines on result.json, sidecar glob contract, max_retries as [forge] contract constant. - docs/FRONTIER-DEFERRED-CATALOG.md: MiniMax-Text-01, Hunyuan-Large, Snowflake Arctic — frontier MoE candidates needing new family adapters before forging. - factory_storage.py: ColdTier.read_mb_per_sec for asymmetric cold-tier speed metadata (grid scheduler wall-clock estimates).
At 5-8 Gbit symmetric residential, HF becomes a first-class storage tier and peer nodes on the Tailscale mesh can serve source weights at LAN speed via gossip-the-hash. New [[storage.network]] schema block + storage tiers section + multi-Gbit unlock note.
…umbing
Three keystone fixes from the 2026-04-09 BigMama Mixtral 8x7B crash:
1) Streaming-load path in forge_model.load_model (forge_model.py).
The CPU-first weight load (device_map="cpu" then .to("cuda"))
loads the entire model into CPU RAM before moving to GPU. For
Mixtral 8x7B (~93GB fp16) on a 62GB WSL2 ceiling, this hits the
memory limit at ~100/291 shards and the OOM killer takes the
daemon mid-load with SIGKILL (no chance to write an error).
This was the actual crash mode observed.
Fix: new streaming=True parameter on load_model that uses
Accelerate's device_map="auto" with explicit max_memory
constraints + disk offload to /mnt/d/cold/hf-offload (the cold
tier). Each shard loads, gets placed on its target device, next
shard loads. Peak CPU memory becomes one shard at a time plus
working overhead, NOT the whole model. Anything that doesn't
fit on GPU+CPU spills to the cold tier.
alloy_executor.py decides when to use streaming based on the
heuristic model_fp16_gb > vram_gb. Mixtral 8x7B (93 > 32) gets
streaming. Mixtral 8x22B (~280 > 32) gets streaming — and is
literally the only path that lets it load on consumer hardware
regardless of WSL2 memory ceiling. Small models keep the
existing CPU-first path so the RTX 5090 + Mamba2 sm_120 kernel
workaround stays active.
2) Heartbeat hardening (factory_queue.py).
The heartbeat used to be written inline at the start of
process_one and on each loop iteration of run_forever. During
a long-blocking executor call (the actual forge), the heartbeat
stayed frozen at "building" with no last_beat_at update. If
the daemon then died mid-forge, .heartbeat.json would lie
indefinitely about state="building" with a stale timestamp
and a dead PID. We observed this exact lying-stale-heartbeat
in the wild after the Mixtral 8x7B crash.
Fix: spawn a daemon thread on FactoryWorker.__init__ that
ticks every heartbeat_interval_seconds (default 30s) and
rewrites .heartbeat.json with the current in-memory state
independently of process_one. The thread runs as long as the
daemon process runs; it dies with the process (daemon=True)
so consumer-side stale-PID detection still works the same way.
Inline write_heartbeat calls are replaced with _set_heartbeat
which updates the in-memory state AND writes through
immediately, so consumers reading right after a state
transition see the new state without waiting for the next tick.
3) priorMetricBaselines[] field plumbing (factory_queue.py).
The field is defined in FACTORY-PROTOCOL.md as part of the
v0.1 sidecar spec but the daemon never read it through to
result.json. Many-Worlds-v0 validation needs this field to
land its random-substrate negative-baseline result with
§4.1.3.4-style provenance from day one — without it, the
negative baseline has nowhere to live structurally.
Fix: in process_one after the executor returns, read the
forged alloy file for any results.priorMetricBaselines[]
array and propagate it through the manifest into the
result.json sidecar. Best-effort, backwards compatible
(degrades to empty list when the field is absent).
All 27 existing factory_queue + factory_daemon tests pass against
the patched code. The streaming-load path is purely additive
(opt-in via a parameter that defaults to False); the heartbeat
hardening is structurally additive (the inline writes still
happen, the thread is the new redundant safety); the
priorMetricBaselines plumbing degrades to a no-op when no
baselines are present.
This unblocks: Mixtral 8x7B retry on bigmama tonight, Mixtral
8x22B forge as the next viral headline (literally not loadable
without streaming), Many-Worlds-v0 tiny-scale validation as the
first paper anchor experiment, every future big-MoE forge.
…l_info The previous patch's streaming-load decision used ctx.info["fp16_gb"] which is computed from dense-model param math (h, n, intermediate_size) in get_model_info. This DRAMATICALLY undercounts MoE models because the math computes per_layer_mlp = h * inter * 3 — i.e. ONE expert MLP — but Mixtral 8x7B has EIGHT experts per layer. For Mixtral 8x7B the dense math returns ~14GB (one expert) while the actual model is ~93GB. The streaming decision then said "14 < 32, no streaming needed" and routed the load through the CPU-first path that OOM-killed the daemon at ~100/291 shards. The first patch's streaming path was correct; the decision logic that gates it was wrong for MoE. Fix: resolve ctx.source_model_dir EARLY (before load_model is called) so we can measure the actual safetensors file sizes on disk and use those for the streaming decision. The disk size doesn't lie — it's the literal number of bytes that need to be loaded, regardless of whether the model is dense, MoE, hybrid attention, vision-encoder- augmented, or anything else get_model_info undercounts. For Mixtral 8x7B: on_disk_gb ≈ 93 > vram_gb 32 → streaming activates → load proceeds via Accelerate's auto device_map with disk overflow to the cold tier. For Mixtral 8x22B: on_disk_gb ≈ 280 > 32 → streaming. For small dense models that already fit comfortably: on_disk_gb < vram → existing CPU-first path stays active (preserves the RTX 5090 + Mamba2 sm_120 kernel workaround). The post-load source_model_dir resolution block is now a no-op for the case where the early resolution succeeded; it stays in place as a safety net for any code path that bypasses execute_alloy. Caught in production on bigmama 2026-04-09 immediately after the streaming-load patch deployed: the new patch's log line "Loading fp16 STREAMING via..." never appeared, and the old "Loading fp16 (CPU → CUDA)" line did. Diagnosis took 5 minutes; this fix took 10. The heartbeat thread held up perfectly during the diagnosis — it was the only reason I knew the daemon was still alive without polling.
Every forge node MUST have its source-weight cache on a native Linux filesystem (xfs preferred, ext4 acceptable). drvfs / 9p / ntfs-3g / CIFS are forbidden for source-weight reads because they will silently wedge mid-forge on big-MoE models. This doc exists because we lost ~14 hours to a drvfs hang during a Mixtral 8x7B forge on 2026-04-10. The drvfs layer wedged in p9_client_rpc during the weight load, the main thread entered uninterruptible D state, the GIL was held inside the blocked C extension so the heartbeat thread couldn't run either, and the only recovery was wsl --shutdown from a Windows PowerShell. We reformatted the cold drive as xfs native in WSL2, and the same forge completed the load phase in 14 minutes without any hangs. Contents: - TL;DR for operators who just want the commands - Why drvfs is unsuitable (the p9_client_rpc diagnosis) - Why xfs specifically (designed for big-file sequential I/O) - WSL2 setup walkthrough (Windows PowerShell + wsl --mount --bare + mkfs.xfs + symlink HF cache) - Native Linux setup (much simpler subset) - Network storage caveats (don't, unless you know the failure modes) - Validation sequence (dd throughput, download test, load test) - Troubleshooting (common errors including the systemd warning on xfsprogs install, wsl --mount availability, reformat-wrong-drive recovery) - Known lessons from the BigMama incident (drvfs silent kills, get_model_info MoE undercounting, heartbeat thread GIL limitation, xfs journal surviving power loss) Cross-referenced with HIVE-NODE-OPERATOR.md and FACTORY-PROTOCOL.md storage tiers section. Forward-references continuum/docs/foreman/ for when the Foreman role eventually automates this setup.
Per Joel 2026-04-10: "these would be excellent things to emit as events back to continuum." Polling ssh is a stopgap; events are the correct abstraction — continuum's universal primitives are Commands.execute() and Events.emit()/subscribe(), and the forge daemon should be a first-class event producer. Implementation — the smallest shippable version that fits the disk-protocol-as-API-contract pattern: - New FactoryQueue.emit_event(kind, **payload) method appends a JSON line to .events.jsonl alongside the existing .heartbeat.json and throughput.jsonl sidecars. Best-effort, swallows exceptions, never blocks a forge on event emission failure. - New FactoryQueue.read_events(since_timestamp, limit) helper for subscribers that want to read the file in batches. - FactoryWorker.process_one() now emits at every transition: forge/started when pickup from intake, forge/stage/started and forge/stage/completed bracketing the executor call and the optional publish call, forge/rework on any exception, forge/completed on successful finish. Each event carries elapsed_s and kind-specific payload (source_model, stages, forged_dir, modelHash, etc.). - The alloy file is parsed once on pickup to extract metadata (source_model, stages list, name) for inclusion in forge/started — best-effort, if the alloy is malformed the executor fails anyway. FACTORY-PROTOCOL.md v0.2 now documents the .events.jsonl sidecar as a first-class protocol element alongside .heartbeat.json and throughput.jsonl. Includes the event schema, the seven initial kinds, required payload fields per kind, example event stream, compatibility rules (tolerate unknown fields, never remove or change semantics of existing fields without a major version bump), rotation semantics, and three subscriber patterns (tail-and-parse, batch read with since-timestamp, republish bridge to continuum's native Events pub/sub). The stream is observability, not load-bearing state. Canonical state lives in .heartbeat.json (liveness + current part) and the station directories (where each alloy physically sits). Events are the history of how state changed, not the state itself. A lost event is a gap in observability but state remains authoritative via the canonical sources. New scripts/forge_events_tail.py is a reference subscriber — reads the file in batches or follows it live (tail -f semantics), formats each event as a human-readable line or raw JSON. Replaces the ssh-and-tail-the-log polling pattern once deployed. Once continuum's Events.emit() bridge is running, this script becomes a reference implementation of how to consume the file-based stream — the same output can come from subscribing to continuum's native pub/sub. All 27 existing factory_queue + factory_daemon tests pass. Deploy plan: wait for Mixtral 8x7B to land in finished/, then deploy via ssh bigmama git pull + daemon restart. The very next forge (Mixtral 8x22B per the ROADMAP-VIRAL-CANDIDATES.md sequence) runs with event telemetry from the start.
Adds the Qwen3.5-35B-A3B-Instruct recipe to the catalog as Row 4 of
the 5-row cross-family anchor table per
continuum/docs/papers/ROADMAP-VIRAL-CANDIDATES.md.
Strategic significance:
- The actual forge-target floor per Joel's standing memory (Qwen3.5+
only, feedback_qwen35_only.md, project_qwen35_forge_targets.md).
Previous catalog rows targeted Qwen3-coder (different family) and
Mixtral, leaving Qwen3.5 as an absent forge-target floor.
- The regression test of the shared adapter base. Qwen3.5 MoE has
hybrid attention (linear + full), requiring Strategy A (skip
non-full-attention layers during surgery) from sentinel-ai#163.
The Strategy A code paths in forge_model.py (is_full_attention_layer,
has_hybrid_layers) have NOT been exercised end-to-end since before
the recent Mixtral-focused work. This recipe is the regression run
that proves the shared base hasn't drifted under all the Mixtral
focus.
- A successful forge validates "adapters not branches" as empirical
principle (feedback_adapters_not_branches memory).
- A failed forge surfaces drift and we fix before continuing.
Recipe is marked with TODO comments on all fields that require
verification against the actual HF config.json before queueing:
- exact HF repo name (may differ from the Qwen/Qwen3.5-35B-A3B-Instruct
placeholder)
- architecture discriminator string (may be "qwen3_next",
"qwen3_5_moe", or a new key; if new, needs new adapter or
registration against existing Qwen3MoEAdapter)
- all source_geometry fields (numLayers, hiddenSize,
moeIntermediateSize, numExpertsPerLayer, contextLength, license)
- family adapter expert_activation_profile + expert_prune stages
must propagate has_hybrid_layers detection
Placed in the catalog between the qwen3-coder-30b-a3b-v2 re-publish
recipe and the Qwen3-VL recipes — the Qwen family continuation slot.
The core primitives + stage executor scaffolding for Milestone 3
(Many-Worlds v0 validation) per continuum/docs/papers/MANY-WORLDS-ABSTRACT.md
and continuum/docs/papers/ROADMAP-VIRAL-CANDIDATES.md. Written in
parallel with the Mixtral 8x7B forge run so the scaffolding is ready
to use the moment Milestone 1 (Mixtral 8x22B) and Milestone 2
(cross-family anchor table) complete.
2218 lines across 6 files in scripts/many_worlds/:
- __init__.py (66 lines) — package entry point, documentation, lazy
imports so `import many_worlds` doesn't require torch.
- substrate.py (437 lines) — SubstrateVectorSpace: the learned
continuous coordinate space. Real-valued vector space with
diagonal Gaussian parameterization per token (Kash's correction
to the hand-wavy "metaphorical Gaussian" framing). Learned basis
matrix + learned read temperature + weight-normalized basis init.
write() converts per-token (mu, log_var) pairs into basis-space
field assignments; read() is the symmetric reverse operation.
Save/load persistence. Lazy torch module construction so Tier 1
dispatch works without torch.
- project_read.py (410 lines) — ProjectModule + ReadModule +
AdapterPair: the per-base-model adapters. LoRA-style (down_proj
→ dropout → activation → {mean, log_var} heads for Project;
up_proj → dropout → activation → out_proj for Read). Zero-init
on output heads so adapter starts as a no-op contribution.
Learnable output_scale parameter that grows during training.
enabled/disabled flag for the §VII.4 Condition A text-bottleneck
baseline and for native-preservation self-test.
- framework.py (483 lines) — ManyWorldsFramework: top-level
orchestrator holding substrate + population of PopulationMember
records + query-face routing. add_member() declares population
without training; attach_adapter() wires a trained pair to a
member; project_residual(), read_into(), cross_project() are the
core operations. save()/load() produces a directory with
manifest.json + substrate.pt + adapters/ subdirectory per member.
Enable/disable all adapters for the §VII.4 five-condition test.
- losses.py (334 lines) — the two-term training objective per
Kash's discipline-gate correction. Phase A: contrastive alignment
(InfoNCE-style over population members) + round-trip reconstruction
(MSE/cosine/L1). Phase B: round-trip fidelity + cross-model
transfer + native preservation regularization. Both phases return
(total_loss, metrics_dict) for easy logging.
- stages.py (488 lines) — forge-alloy stage executor scaffolding.
SubstrateTrainExecutor (Phase A), AdapterTrainExecutor (Phase B),
ManyWorldsEvalExecutor (the §VII five-condition comparison).
Each executor has its full algorithm documented inline as the
scaffold's docstring; the actual torch training loop body is
stubbed as NotImplementedError with clear TODOs pointing to the
training-loop files that will land in follow-up commits
(train_substrate.py, train_adapters.py, eval_v0.py).
This package is the concrete architectural embodiment of Joel's
"destroy them with their own weight" strategic thesis: every line
is written to operate on frozen, publicly-released weights from
HuggingFace and make them do something the releasers cannot do —
namely, coordinate with each other at the representation layer via
a shared substrate. The ammunition for the revolution is already
published; the primitives in this package are the mechanism that
turns published weights into a coordinated alternative.
All 6 files syntactically valid (ast.parse). Torch is not required
for package import; it's loaded lazily inside the methods that
actually need it.
Ready to be consumed by the v0 driver (train_substrate.py,
train_adapters.py, eval_v0.py — separate follow-up commits) and by
the forge-alloy schema extensions that register the new stage
types (substrate-train, adapter-train, many-worlds-eval).
Attribution per MANY-WORLDS-ABSTRACT.md:
Joel — framework naming, economic argument, multi-model fusion
vision, "destroy them with their own weight" strategy
Dorian — the foundational LoD primitive this extends
Kash — empirical discipline gate, prior-art positioning, loss
design (two-term objective insistence)
Claude — this code, architecture sketch, package structure
Strategic placement: this commit is scaffolding written during the
Mixtral 8x7B forge run, in parallel with active monitoring. It's a
demonstration of Joel's "the flywheel must be continuous" principle
in its sustainable form — code work that doesn't require BigMama
attention, that advances the roadmap's Milestone 3 prerequisites,
and that persists across sessions via git so future Claude instances
can pick it up without conversation distillation loss.
…sses
71 unit tests across 4 test files covering the Many-Worlds primitives
in isolation. All pass on first run (with one tensor-construction
warning fix in substrate.py included in this commit).
Test coverage:
tests/unit/many_worlds/test_substrate.py (17 tests)
- SubstrateConfig construction + serialization roundtrip
- SubstrateVectorSpace lazy module build
- Parameter enumeration for optimizer
- All 3 init strategies (orthogonal, xavier, normal)
- write() tensor shape contract + softmax row sums to 1
- write() handles variable seq lengths
- write() clamps extreme log_var values
- read() tensor shape contract
- read() as weighted basis combination
- Save/load roundtrip with trained flag preservation
- Differentiability of write() and read()
tests/unit/many_worlds/test_project_read.py (16 tests)
- AdapterConfig construction + serialization roundtrip
- ProjectModule output shape (B, S, substrate_dim) for both heads
- Zero-init behavior: fresh Project/Read produce near-zero outputs
- enabled=False returns zero tensors
- set_enabled toggles behavior
- ReadModule output shape (B, S, residual_hidden_size)
- AdapterPair construction and parameter enumeration
- AdapterPair save/load roundtrip
- Differentiability of both modules
tests/unit/many_worlds/test_framework.py (21 tests)
- FrameworkConfig defaults and serialization
- Population management: add_member, get_member, duplicate detection
- Default layer_idx computed from default_layer_fraction
- Adapter attachment with shape validation (residual_hidden_size
and substrate_dim must match)
- disable_all_adapters / enable_all_adapters population-wide
- cross_project shape contract (source residual size → target residual size)
- project_residual raises if adapter not attached
- substrate_parameters + adapter_parameters (global + scoped)
- Full save/load roundtrip with empty and non-empty populations
tests/unit/many_worlds/test_losses.py (17 tests)
- contrastive_alignment_loss: two/three member populations
- contrastive_alignment: perfect alignment yields lower loss than random
- Single-member population returns zero (no contrastive signal)
- round_trip_reconstruction_loss: MSE, cosine, L1
- MSE is zero for identical tensors, cosine is zero for identical
- Unknown loss_type raises ValueError
- native_preservation_loss: zero below max_scale, quadratic above
- Handles negative scales (abs value)
- phase_a_loss: structure, metrics dict, weight effects
- phase_b_loss: structure, zero when all weights are zero
- phase_b_loss is differentiable
Also included: small tensor construction warning fix in substrate.py
where `torch.tensor(torch.log(torch.tensor(temp_init)))` was producing
a deprecation warning. Replaced with `torch.tensor(math.log(temp_init),
dtype=torch.float32)` which is the idiomatic form. Remaining warnings
(11) are all from torch.nn.utils.weight_norm being deprecated — a
clean swap for later but not blocking.
Test run: `python3 -m pytest tests/unit/many_worlds/ -q` → 71 passed,
15 warnings in 0.13s.
The scaffolding is verified. The Many-Worlds primitives work correctly
in isolation; all that's left for a working v0 validation is the
training loops (train_substrate.py, train_adapters.py) and the
five-condition eval driver (eval_v0.py). Those are the next files to
land, unblocking the actual §VII empirical validation the moment
Milestones 1 and 2 complete and it's Milestone 3's turn.
Root cause diagnosed via sudo py-spy dump on bigmama 2026-04-10:
Mixtral 8x7B with streaming-load (device_map="auto", fp16 split across
GPU+CPU) caused EVERY forward pass to trigger Accelerate's
set_module_tensor_to_device (CPU⇔GPU layer swapping) for each of the
32 transformer layers. Each forward pass took minutes instead of
milliseconds. The daemon ran for 90+ minutes and completed exactly
ONE forward pass of the baseline eval.
Fix: three-way load strategy decision tree:
(a) Model fits on GPU in fp16 → existing CPU-first path (fast).
(b) Model too big for fp16 BUT fits in 4-bit → force 4-bit load.
The entire model lands on GPU in quantized form. Forward passes
are GPU-bound, fast, no device swapping. For Mixtral 8x7B:
~93GB fp16 → ~27GB 4-bit → fits in 32GB VRAM.
(c) Model doesn't fit even in 4-bit → streaming-load with
device_map="auto" and disk overflow (the only option for truly
huge models like Mixtral 8x22B at ~70GB in 4-bit on 32GB GPU).
WARNING: forward passes will be slow due to CPU⇔GPU swapping.
Why 4-bit profiling is valid for the activation profile stage:
- The router gate is a tiny Linear(4096→8) layer — 32K params,
negligible quantization error
- We're counting WHICH experts get selected by topk, not the
magnitude of the logits — relative orderings are robust to quant
- The calibration corpus (300+ examples) is large enough that even
if a few tokens flip expert selection due to quant noise, the
aggregate counts are stable
- The expert-prune stage downstream reads fp16 safetensors from
ctx.source_model_dir (disk), NOT the in-memory quantized model.
Pruning precision is unaffected.
The streaming-load path (c) is still needed and tested for Mixtral
8x22B which literally cannot fit on a single GPU in any precision.
That case will be slow due to device swapping — plan for hours-long
activation profiles on streaming-loaded models.
All 27 factory tests pass. The ForgeConfig override for path (b)
constructs a fresh tier-C config inline since the original
ForgeConfig.auto() was deceived by get_model_info's MoE undercount
(fixed in 3efd4b4 for the streaming decision but the auto() function
itself still uses the wrong number — fixing auto() is a separate
commit to avoid cascading changes tonight).
BnB 4-bit + device_map="auto" triggers validate_environment which
refuses to proceed if any module would spill to CPU, even when the
4-bit model actually fits on GPU. This was the failure on bigmama:
Mixtral 8x7B at ~27GB 4-bit on 32GB VRAM → "auto" said some modules
dispatched to CPU → ValueError before loading even started.
Fix: use device_map={"": 0} which forces all modules to cuda:0
without asking BnB for permission. If the model truly doesn't fit,
we get an honest CUDA OOM at load time (recoverable) instead of a
preemptive validation refusal.
Third attempt at the Mixtral 8x7B load strategy. History:
1. fp16 streaming (device_map=auto, no 4-bit): loaded successfully
but forward passes were pathologically slow — py-spy showed the
main thread pinned in set_module_tensor_to_device doing CPU⇔GPU
layer swaps. Each forward pass took minutes. Diagnosis: correct.
2. 4-bit forced to GPU (device_map={"": 0}): CUDA OOM. Mixtral 8x7B
at 4-bit with BnB overhead (scales, zero points, fp16 embed/lmhead,
buffers) exceeds 32GB VRAM. The 26.7GB estimate was wrong.
3. THIS FIX: 4-bit with device_map="auto" + llm_int8_enable_fp32_cpu_offload=True.
BnB's recommended hybrid path. Most of the model stays on GPU in
4-bit; overflow modules (embed, lm_head, a few expert layers that
don't fit) go to CPU in fp32. Forward passes are MOSTLY GPU-bound
with only occasional CPU access for the overflow — way faster than
the fp16 streaming path that was swapping entire transformer layers.
Despite the "int8" in the flag name, llm_int8_enable_fp32_cpu_offload
controls 4-bit mixed-device loading too. Without it, BnB's
validate_environment refuses to proceed. With it, the auto device
map splits the model across GPU+CPU with the GPU taking as much
as it can fit.
This is the correct load strategy for models that are:
- Too big for fp16 on GPU (Mixtral 8x7B at 93GB fp16 on 32GB)
- Too big for 4-bit-only on GPU (with BnB overhead, >32GB)
- Small enough that 4-bit + a few fp32 CPU layers is mostly-GPU
For Mixtral 8x22B (truly huge, ~70GB in 4-bit on 32GB GPU):
path (c) streaming fp16 is still the only option, and forward passes
will be slow. That's a Milestone 1 problem to solve separately.
MoE models (Mixtral, etc.) need offload_folder set during 4-bit quantized loading because the auto device_map may spill MoE expert weights to disk for re-saving. Without offload_folder, transformers raises 'provide an offload_folder for them in from_pretrained'. Uses the same /mnt/cold/hf-offload path as the streaming-load path. Created if it doesn't exist. Only consumed when the device map actually needs disk offload; ignored otherwise.
transformers 5.3.0 passes _is_hf_initialized kwarg when reconstructing Params4bit objects in set_module_tensor_to_device. BnB 0.49.2's Params4bit.__new__ doesn't accept it and raises TypeError. Monkey-patch filters the kwarg at the Params4bit.__new__ level. Kink #7 in the Mixtral 8x7B load sequence. TODO: remove when bitsandbytes >= 0.50.0 ships with native support for this kwarg.
The default streaming_offload_folder was /mnt/d/cold/hf-offload — the old drvfs NTFS path. After the xfs reformat, /mnt/d/ is no longer a mount point. mkdir -p created /mnt/d/cold/ on the ROOT filesystem, and 4-bit MoE offload writes filled ROOT to 100% (43 GB of offloaded expert weights landed on / instead of the xfs cold tier). Kink #8. Fix: default offload path now /mnt/cold/hf-offload (the xfs mount). Also cleaned 43 GB of stale offload data + 15 GB of old work dirs from the hot tier, recovering 58 GB.
Template with {{PLACEHOLDER}} fields for benchmark numbers, model hash,
alloy hash, quant links, and timing that get filled in from result.json
when the forge completes. Everything else is ready to publish:
- §4.1.3.4 methodology explanation with paired negative baselines table
- Consumer hardware story with the 8-kink production-issues list
- Cross-family anchor table showing this as Row 2
- GGUF quant tier download table (Q4_K_M through fp16)
- Alloy provenance (hash, commit, recipe file)
- Usage examples (transformers + llama.cpp + Ollama)
- Attribution (Joel, Dorian, Kash, Claude)
- Contributing invitation matching the README rewrite
When the forge lands in finished/, fill the placeholders from
result.json + the alloy's results block and publish.
BnB 0.49.2's QuantState.as_dict() calls self.offset.item() during
accelerate's dispatch hook installation. When accelerate moves the
quant_state to the meta device for deferred materialization, .item()
raises RuntimeError('cannot be called on meta tensors').
This is a BnB bug, NOT a transformers/accelerate version issue —
reproduces with both transformers 5.3.0 and 4.57.6.
Patch: materialize meta-device offset as a CPU zero tensor before
as_dict runs. The offset is a nested-quantization correction that
defaults to zero when uninitialized, so this is safe. Also patches
nested state2.offset for double-quantization (bnb_4bit_use_double_quant).
Kink #10 in the Mixtral 8x7B load sequence. Combined with patch 1
(Params4bit kwarg filtering), these two patches are the full BnB
0.49.2 compat layer needed for 4-bit hybrid loading of MoE models
on consumer hardware.
…#11) MixtralAdapter.expert_activation_profile called profile_experts() without passing gate_attr_path, which defaults to 'mlp.gate' (the Qwen3MoE path). Mixtral's router gate lives at 'block_sparse_moe.gate'. Result: 'hooks registered on 0/32 layers' and zero activation counts. One-line fix: pass gate_attr_path='block_sparse_moe.gate' from MixtralAdapter. The profile_experts API already supports the parameter; the adapter just wasn't using it. This is the LAST kink before the activation profile actually runs. Baseline eval already completed successfully (ppl=8.14, 27/27 batches) proving the 4-bit hybrid load + dispatch + forward passes all work. The activation profile is the only remaining untested stage.
Both MixtralAdapter and MoEUnfusedExpertsBase reloaded pruned models via raw AutoModelForCausalLM.from_pretrained with device_map="auto" but NO BitsAndBytesConfig. For a 70.9 GB pruned Mixtral 8x7B on a 32 GB GPU, this loaded in fp16 across CPU+disk, causing the same CPU⇔GPU swap pathology from kink #3 — each forward pass during the post-prune eval took minutes instead of seconds. Fix: replace raw from_pretrained with load_model() which has all the 4-bit hybrid path logic (BnB config, fp32 CPU offload, offload folder, BnB compat patches). The reload now measures the pruned model's on-disk size and decides fp16 vs 4-bit the same way the initial load does. For Mixtral 8x7B pruned (70.9 GB > 32 GB VRAM): 4-bit hybrid → ~20 GB on GPU → fast eval. Applied to both: - scripts/adapters/sota_moe.py (MixtralAdapter, PhiMoEAdapter, GraniteMoEAdapter, DeepSeekV2Adapter) - scripts/adapters/moe_unfused_base.py (Qwen3MoEAdapter, OLMoEAdapter) Every post-prune reload in every adapter family now goes through load_model(). The 4-bit decision is based on on-disk size vs VRAM, same as the initial load. No more raw from_pretrained bypassing the consumer-hardware accommodations. Kink #13 of 13 in the Mixtral 8x7B forge. The current forge run will complete slowly (fp16 reload already in progress); the NEXT run will reload in 4-bit and eval in minutes.
…uned MoE Reverts post-prune reload to fp16 streaming. BnB 0.49.2 cannot handle 4-bit loading of pruned MoE safetensors — meta tensor errors in quant_state.code during forward pass (kink #14). The fp16 path is slow (~2-3 hours for eval) but produces valid results. TODO: switch to 4-bit when BnB >= 0.50 ships.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
The complete factory pipeline build-out from 2026-04-09: end-to-end forge → assay → publish loop running on BigMama as a hive node daemon, every MoE family graduated to a real Tier 2 body, the Open LLM Leaderboard v2 and Open VLM Leaderboard runner packs added, 17 bugs caught and fixed by running it against real models on real GPU hardware.
What's new
Family adapter set — every MoE family is real
MixtralAdapter(zero duplicated body — same OOP rule the dense bases use)DEEPSEEK_V2_LAYOUT. Shared experts and the dense first layer verified bit-exact in synthetic E2E testFusedLayoutSpec+prune_experts_fusedslice along the expert axis instead of delete-and-renameqwen3_vlandqwen3_vl_moe(was onlyqwen2_5_vl/qwen3_5_vl)FamilyAdapter.model_auto_class()new hook — VL families returnAutoModelForVision2Seq, omni returnsAutoModel, defaultAutoModelForCausalLM. Replaces hardcoded loaderFamilyAdapter.default_train_params(ctx)new hook — adapter-driven training defaults (steps/LR scale by source.totalParamsB, domain picked from source.baseModel name). No hardcoded values in the seederEval runner pack
LmEvalHarnessRunnerbase + 6 thin subclassesLmmsEvalHarnessRunnerinheritingLmEvalHarnessRunner(samescore(), onlyevaluate()overridden)eval_with_calibration.pymigrated to registry dispatch — the §4.1.4.1 anchor-reproduction discipline gate now uses the same axis as production scoring (no more if-elif chains)ExpertActivationProfileExecutorregistered intransform_stages.py— was missing entirely, causing silent stage skipsexpert_activation_profile.profile_experts(gate_attr_path=...)accepts the family-specific router gate path (mlp.gatefor Qwen3MoE,block_sparse_moe.router.layerfor Granite)Factory pipeline (the hive node loop)
scripts/factory_queue.py— long-running daemon, disk-backed assembly line (intake//assembly//finished//rework/), atomic part transitions, crash recovery via stale-PID detection, retry counter encoded in filename,.heartbeat.json+ PID lock +throughput.jsonlaudit logscripts/factory_storage.py— S3-style storage tiers, reference counting, auto-cleanup of orphan work dirs,--cleanup-cold-rootfor the 7200rpm spinner--list,--list-station,--retry,--enqueue,--status --prettydashboard,--tail,--recoveralloy_hashing.compose_model_hashof the forged shards for chain-of-custodyscripts/seed_factory_queue.py— HF-verified catalog of 16 candidates, every recipe schema-validated AT SEED TIME (catches bugs minutes before the daemon picks them up)scripts/bootstrap-hive-node.sh— one-shot setup for any new forge grid node (idempotent post-power-failure recovery)forge-alloy schema (separate PR on forge-alloy repo)
AcceptanceCriterianew top-level field — the part spec, the gate, lives WITH the alloyExpertActivationProfileStageadded to the discriminated stage unionExpertPruneStage.keep_experts_per_layeras alias forkeepExpertsTrainStage.domain/steps/learningRatemadeOptional(adapter-driven defaults at runtime)17 bugs caught and fixed by the live BigMama smoke test
Every one was found by running it against a real model on real GPU hardware. None showed up in unit tests because they all required end-to-end execution.
expert-activation-profilestage typekeepExpertsPerLayerfield namectx.source_model_dirnever populated by alloy_executor → resolve from HF cache snapshot pathexpert-activation-profilestage type had no registered StageExecutorctx.devicewas the GPU display name not the torch device stringprune_experts_fusedread importance JSON's wrong key (per_layervsactivation_counts)ctx.alloy.get('results', {}).get(...)crashed when results is None — six instances acrossoutput_stages.py,alloy_to_card.py,publish_model.py, all swept withor {}forged_dir/but safetensors are atforged_dir/model/_find_domainchecked'domain' in sbut Pydantic injects None for unset Optional fieldsdefault_train_paramsreturneddomain='wikitext'(a dataset name) but the field is a registry KEY (general|code| ...)forge_model.evaluate()and train loop computed loss INCLUDING pad tokens —padding='max_length'puts ~1998 pad tokens in a 50-token wikitext sample, thenlabels=idsdoesn't mask them, so loss is dominated by pad-position garbage. Inflated baseline perplexity ~10-30x. The qwen2-5-7b-instruct-compacted incidentdefrag_live_modelslice mode without specifying — slice produces per-layer shape divergence that the singlemodel.config.num_attention_headscan't represent. Fix: usemode='pad'. The qwen2-5-7b-instruct-compacted load failurefrom_pretrainedBEFORE the forge marks the part finished. Catches the entire class of "save succeeds but reload fails" bugs at forge timebootstrap-hive-node.shfor fresh node setupTests
Models shipped + pulled
Two models forged + published + pulled in the same day, both due to the eval bugs caught by the integrity audit:
continuum-ai/granite-3-0-3b-a800m-compacted— pulled, forge eval inflated perplexity 11×, real Δ was −98.8% (model needs recovery training to be viable at this prune ratio)continuum-ai/qwen2-5-7b-instruct-compacted— pulled, save-then-reload fails (defrag slice mode shape divergence). The full bug write-up is in the rework error sidecars.The integrity check worked: we caught our own bugs before any external user could. The forged artifacts were pulled, the model card claims were withdrawn, the upstream bugs were fixed. v2 of both models lands once we re-forge with the corrected pipeline.
Standing directive context
Per
docs/PLUGIN-SPRINT.md(top of file): close the gaps and go viral. This PR closes all 4 priority gaps from the standing directive (Phi-3.5-MoE, DeepSeek-V2, GraniteMoE, eval registry migration) plus the entire factory pipeline that wasn't on the original list but is the prerequisite for actually shipping models 24/7.