schema: AcceptanceCriteria + ExpertActivationProfileStage + Optional TrainStage fields by joelteply · Pull Request #12 · CambrianTech/forge-alloy

joelteply · 2026-04-09T19:20:20Z

TL;DR

Schema additions and relaxations needed by the sentinel-ai factory pipeline 2026-04-09 work (CambrianTech/sentinel-ai#169). Three changes, all backwards compatible with every existing published continuum-ai/* alloy.

Changes

1. `AcceptanceCriteria` — the part spec, gate-as-alloy-field

New top-level optional field on ForgeAlloy and four new model classes:

BenchmarkAcceptance — per-benchmark min (0..1), optional anchorDelta (the §4.1.3.4 discipline gate: forged score must be within Δ of the base anchor in the same eval pipeline), optional anchorBenchmark
AcceptanceHardware — maxVramGb, deviceTier
AcceptanceIntegrity — modelHashRequired, samplesPathRequired
AcceptanceCriteria — top-level container with benchmarks: dict[str, BenchmarkAcceptance], hardware, integrity

The alloy IS the part spec. In the assembly-line metaphor, every part has a spec sheet that travels with it down the line. AcceptanceCriteria is that spec — declared by the recipe author, self-contained in the alloy file. Sentinel-ai forges + assays; continuum (the shipping department) reads BOTH the assayed scores AND the alloy's acceptanceCriteria and decides ship vs rework.

2. `ExpertActivationProfileStage` — the §4.1.3.4 calibration-aware metric stage

Added to the discriminated AlloyStage union. Was missing from the schema even though the morning's qwen3-coder-30b-a3b-compacted-19b-256k flagship used it. This filled the gap so intent-only alloys with the calibration profile stage validate cleanly.

3. `TrainStage` — `domain`, `steps`, `learning_rate` made `Optional`

Was: required fields, the seeder had to hardcode default values.
Now: Optional[T] = None, the family adapter's default_train_params(ctx) hook fills them in at execution time.

The right architectural pattern: recipes declare INTENT ({type: train, method: lora}), the family adapter knows what works for its architecture and model size, fills in domain/steps/LR/etc. at runtime. Recipe authors override only when they want to override.

Backwards compat: every existing alloy that DOES specify domain/steps/learningRate still validates and the values are still used as-is. The Optional change only affects intent-only alloys.

Tests

25 passed (+8 new for AcceptanceCriteria and TrainStage Optional behavior)

Companion PR

Sentinel-ai side: CambrianTech/sentinel-ai#169 — uses these schema changes via the forge_alloy.types.AcceptanceCriteria import in seed_factory_queue.py and the family adapter's default_train_params() hook in alloy_executor.py + transform_stages.py.

… parse Checkpoint commit. NOT the architectural fix. The 3 published continuum-ai/* alloys (qwen3-coder-30b-a3b-compacted-19b-256k, olmoe-1b-7b-compacted-5b, qwen2.5-coder-7b-compacted) now validate against ForgeAlloy.model_validate_json() instead of failing with 5-6 errors each. Done by extending core types with sentinel-ai-specific fields (expert-activation-profile, compensation-lora, keepExpertsPerLayer, priorMetricBaselines, calibrationCorpora, etc) and relaxing several required fields to optional. This is the WRONG layer — these belong in an llm-forge domain extension per FORGE-ALLOY-DOMAIN-EXTENSIBILITY.md, not bolted into the universal core. Sentinel-ai is supposed to be a black-box consumer of the universal contract, not a shape that the core mirrors field-for-field. Committing as a checkpoint so the work isn't lost while the domain-registry refactor (work items 0-5 in the extensibility doc) lands properly. The next commit moves every field added here out of types.py and into a domain extension module, restoring the universal core to its pre-checkpoint shape plus only the 'domains[]' registry hook.

Roadmap step 5 from sentinel-ai/docs/PLUGIN-SPRINT.md and the schema-side proposal in continuum/docs/architecture/FORGE-ALLOY-DOMAIN-EXTENSIBILITY.md. Adds the domain-extension package that the bd4349d checkpoint commit on this branch SHOULD have built instead of bolting ML-specific fields into the universal core. Per the never-lose-work rule, the bd4349d state is preserved on the wip/types-additive-checkpoint-bd4349d branch and is not destroyed by this commit. Per TDD/TDValidation discipline: test first, then implementation. The contract test is in python/tests/test_domain_extension_layout.py; the existing python/tests/test_regression_published_alloys.py acts as the end-to-end gate that the 17 published continuum-ai/* artifacts still validate cleanly through the post-refactor schema. == What landed python/forge_alloy/domains/ — new package base.py DomainExtension ABC. Each registered extension owns: - id (the string the alloy's domains[] field carries) - stage_types() → dict[str, type] (Pydantic models for stages this domain owns) - root_extensions() → dict[str, type] (Pydantic models for root fields this domain adds) registry.py DomainRegistry — id-string → DomainExtension class lookup. Mirror of scripts/adapters/registry.py and scripts/eval_runners/registry.py in sentinel-ai. Strict exact-match dispatch, idempotent same-class re-registration, raises on different-class against existing id (silent shadowing is the f-word pattern). KeyError on unknown id includes the full registered list and the file/registration recipe to add the missing one. llm_forge.py LlmForgeDomain — registered against id 'llm-forge'. Owns every ML-specific stage type: source-config, prune, train, lora, compact, quant, package, eval, publish, deploy, expert-prune, expert-activation-profile, compensation-lora, context-extend, modality, deliver Owns every ML-specific root extension: calibrationCorpora list[CalibrationCorpusRef] priorMetricBaselines list[PriorMetricBaseline] Today, this module RE-EXPORTS the ML types from forge_alloy.types where they currently live (the bd4349d checkpoint state). Consumers can import from EITHER: from forge_alloy import ExpertPruneStage (legacy public API) from forge_alloy.domains.llm_forge import ExpertPruneStage (new path) Both resolve to the same class object today. The full extraction (moving the actual class definitions out of types.py and into llm_forge.py) is a follow-up refactor commit. The dependency direction is strict and enforced by test_universal_core_does_not_import_llm_forge: extensions → core, never core → extensions. photo_provenance.py PhotoProvenanceDomain — stub. Registered against id 'photo-provenance'. Empty stage_types and root_extensions today. Witness that the registry handles non-ML domains without any change to the universal core. Real schemas land when the first photo-provenance artifact ships (camera enclave → edits → publish chain). ticketing.py TicketingDomain — stub. Registered against id 'ticketing'. Empty schemas today. Witness for the venue-ticket / FedEx-delivery / concert-ticket use case from forge-alloy's APPLICATIONS.md. __init__.py Module-level singleton + register_domain / resolve_domain / registered_domains helpers. Eager imports of llm_forge, photo_provenance, ticketing register all three at package import time. Adding a new domain is exactly one new file + one import + one register() call here. == Schema gaps caught by the regression test (real bugs, fixed inline) The python/tests/test_regression_published_alloys.py end-to-end gate exposed several places where the schema was silently dropping fields that the published continuum-ai/* alloys actually carry. These were real bugs (fields the schema didn't know about, dropped on validation, missing on round-trip) and the fix is to add the missing fields to the schema and to allow extras everywhere artifact-specific extras land: AlloyHardware: + device_targets list[str] alias='deviceTargets' (every published alloy carries this — was being silently dropped) + extra='allow' for any future hardware-tier extras AlloyResults: + forged_params_b float alias='forgedParamsB' (MoE-specific param count for the morning's qwen3-coder-30b-a3b and OLMoE flagships — published values were 19.66 and 5.x) + active_params_b float alias='activeParamsB' (unchanged through expert pruning per § 4.1.3.4) + extra='allow' so artifact-specific result extras (fourRunProgression, lossFunctionAblation on v2-7b-coder-compensated) round-trip cleanly BenchmarkResult: + score, base_score, delta, calibrated, samples_path, base_samples_path, result_hash, base_result_hash, metric — all fields the publish pipeline (alloy_to_card.py) and the Tier 4 reproducibility test (sentinel-ai/tests/reproducibility/test_published_alloys_scoring.py) both consume but the schema was hiding behind a generic 'metrics' open dict. Now they're first-class. All other BaseModel classes: model_config now has extra='allow' so artifact-specific extras (notes, methodology anchor URLs, custom provenance fields) preserve verbatim through the round-trip. The schema's named fields stay the canonical surface that publish_model.py + alloy_to_card.py read; extras are recognized as artifact-specific provenance and don't cause silent data loss. == Test status python/tests/test_domain_extension_layout.py: 17 passed python/tests/test_regression_published_alloys.py: 3 passed (qwen3-coder-30b-a3b, olmoe-1b-7b, qwen2.5-coder-7b) Combined: 20 forge-alloy tests, 0 failures Cross-repo sanity: sentinel-ai's reproducibility + unit-test suite still 60 passed / 2 xfailed after this change (the xfails are the same priorMetricBaselines.samplesHash gap that closes in roadmap step 8). Side fix: python/tests/test_regression_published_alloys.py - sys.path now includes python/ so the script + pytest both find forge_alloy without the caller having to PYTHONPATH-set - expected_alloy_hash_prefix for qwen3-coder-30b-a3b updated from aa61c4bdf463847c → 011970c80c2f3429 to reflect the post-correction state pushed in sentinel-ai commit 1bc32d2 (the canonical-evalplus humaneval_plus correction) - semantic_equivalent treats int/float as numerically equivalent when their values match (Pydantic coerces int → float on Optional[float] fields and the round-trip emits float) - round-trip uses exclude_unset=True (preserves null fields) instead of exclude_none=True (was dropping them) Side fix: .gitignore now excludes __pycache__, *.pyc, *.pyo, .pytest_cache so Python bytecode never sneaks into commits. == Next Roadmap step 6: vision-safety integration (Qwen3VLAdapter consults the existing scripts/vision_safety.py whitelist). Step 7 unifies the modelHash convention across publish_model.py and the backfill tools. Step 8 closes the priorMetricBaselines.samplesHash schema gap and uploads the calibration corpora alongside the model weights.

The alloy IS the part spec. In the assembly-line metaphor every part has a spec sheet that travels with it down the line; the alloy carries the recipe, source, integrity attestation, AND the gate the part must clear before the shipping department releases it. Sentinel-ai forges and assays — it NEVER reads acceptanceCriteria. Continuum (the shipping department) reads BOTH the assayed scores written into the finished/ manifest AND the alloy's acceptanceCriteria, and decides ship vs rework. Same alloy → same gate verdict on any forge run by anyone, anywhere — the spec is portable. New types: BenchmarkAcceptance — per-benchmark floor + 4.1.3.4 anchorDelta gate AcceptanceHardware — maxVramGb + deviceTier AcceptanceIntegrity — modelHashRequired + samplesPathRequired AcceptanceCriteria — top-level container ForgeAlloy.acceptance_criteria is Optional[AcceptanceCriteria] (default None) — backwards compat: every existing published continuum-ai/* alloy keeps loading. The field serializes under the camelCase alias 'acceptanceCriteria' to match every other alloy field on disk. The 4.1.3.4 anchorDelta semantic: negative means 'forged score must be within |delta| points BELOW the base anchor measured in the same eval pipeline'. The morning's qwen3-coder-30b shipped at delta -3.7 against the 92.1 base anchor; the catalog's v2 re-forge alloy declares anchorDelta: -3.7 to lock in the same gate. 8 new tests, 25/25 forge-alloy passing.

…iven) The seeder shouldn't be hardcoding training defaults. Each family adapter knows what corpus/step-count/LR works best for its architecture and model size. Recipes declare INTENT ({type: train, method: lora}) and the family adapter fills in the rest at execution time via default_train_params(ctx). These three fields go from required to Optional[None]. The schema no longer rejects intent-only train stages. Backwards compat: every existing alloy that DOES specify them still validates fine because None is accepted alongside the prior types.

joelteply added 4 commits April 8, 2026 12:54

joelteply mentioned this pull request Apr 9, 2026

docs: factory pipeline UI + forge-alloy domain extensibility refactor CambrianTech/continuum#852

Merged

joelteply merged commit 1aa413c into main Apr 10, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schema: AcceptanceCriteria + ExpertActivationProfileStage + Optional TrainStage fields#12

schema: AcceptanceCriteria + ExpertActivationProfileStage + Optional TrainStage fields#12
joelteply merged 4 commits intomainfrom
domain-extensibility-refactor

joelteply commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joelteply commented Apr 9, 2026

TL;DR

Changes

1. AcceptanceCriteria — the part spec, gate-as-alloy-field

2. ExpertActivationProfileStage — the §4.1.3.4 calibration-aware metric stage

3. TrainStage — domain, steps, learning_rate made Optional

Tests

Companion PR

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `AcceptanceCriteria` — the part spec, gate-as-alloy-field

2. `ExpertActivationProfileStage` — the §4.1.3.4 calibration-aware metric stage

3. `TrainStage` — `domain`, `steps`, `learning_rate` made `Optional`