Skip to content

schema: AcceptanceCriteria + ExpertActivationProfileStage + Optional TrainStage fields#12

Merged
joelteply merged 4 commits intomainfrom
domain-extensibility-refactor
Apr 10, 2026
Merged

schema: AcceptanceCriteria + ExpertActivationProfileStage + Optional TrainStage fields#12
joelteply merged 4 commits intomainfrom
domain-extensibility-refactor

Conversation

@joelteply
Copy link
Copy Markdown
Contributor

TL;DR

Schema additions and relaxations needed by the sentinel-ai factory pipeline 2026-04-09 work (CambrianTech/sentinel-ai#169). Three changes, all backwards compatible with every existing published continuum-ai/* alloy.

Changes

1. AcceptanceCriteria — the part spec, gate-as-alloy-field

New top-level optional field on ForgeAlloy and four new model classes:

  • BenchmarkAcceptance — per-benchmark min (0..1), optional anchorDelta (the §4.1.3.4 discipline gate: forged score must be within Δ of the base anchor in the same eval pipeline), optional anchorBenchmark
  • AcceptanceHardwaremaxVramGb, deviceTier
  • AcceptanceIntegritymodelHashRequired, samplesPathRequired
  • AcceptanceCriteria — top-level container with benchmarks: dict[str, BenchmarkAcceptance], hardware, integrity

The alloy IS the part spec. In the assembly-line metaphor, every part has a spec sheet that travels with it down the line. AcceptanceCriteria is that spec — declared by the recipe author, self-contained in the alloy file. Sentinel-ai forges + assays; continuum (the shipping department) reads BOTH the assayed scores AND the alloy's acceptanceCriteria and decides ship vs rework.

2. ExpertActivationProfileStage — the §4.1.3.4 calibration-aware metric stage

Added to the discriminated AlloyStage union. Was missing from the schema even though the morning's qwen3-coder-30b-a3b-compacted-19b-256k flagship used it. This filled the gap so intent-only alloys with the calibration profile stage validate cleanly.

3. TrainStagedomain, steps, learning_rate made Optional

Was: required fields, the seeder had to hardcode default values.
Now: Optional[T] = None, the family adapter's default_train_params(ctx) hook fills them in at execution time.

The right architectural pattern: recipes declare INTENT ({type: train, method: lora}), the family adapter knows what works for its architecture and model size, fills in domain/steps/LR/etc. at runtime. Recipe authors override only when they want to override.

Backwards compat: every existing alloy that DOES specify domain/steps/learningRate still validates and the values are still used as-is. The Optional change only affects intent-only alloys.

Tests

25 passed (+8 new for AcceptanceCriteria and TrainStage Optional behavior)

Companion PR

Sentinel-ai side: CambrianTech/sentinel-ai#169 — uses these schema changes via the forge_alloy.types.AcceptanceCriteria import in seed_factory_queue.py and the family adapter's default_train_params() hook in alloy_executor.py + transform_stages.py.

… parse

Checkpoint commit. NOT the architectural fix.

The 3 published continuum-ai/* alloys (qwen3-coder-30b-a3b-compacted-19b-256k,
olmoe-1b-7b-compacted-5b, qwen2.5-coder-7b-compacted) now validate against
ForgeAlloy.model_validate_json() instead of failing with 5-6 errors each.

Done by extending core types with sentinel-ai-specific fields
(expert-activation-profile, compensation-lora, keepExpertsPerLayer,
priorMetricBaselines, calibrationCorpora, etc) and relaxing several required
fields to optional. This is the WRONG layer — these belong in an llm-forge
domain extension per FORGE-ALLOY-DOMAIN-EXTENSIBILITY.md, not bolted into the
universal core. Sentinel-ai is supposed to be a black-box consumer of the
universal contract, not a shape that the core mirrors field-for-field.

Committing as a checkpoint so the work isn't lost while the domain-registry
refactor (work items 0-5 in the extensibility doc) lands properly. The next
commit moves every field added here out of types.py and into a domain
extension module, restoring the universal core to its pre-checkpoint shape
plus only the 'domains[]' registry hook.
Roadmap step 5 from sentinel-ai/docs/PLUGIN-SPRINT.md and the schema-side
proposal in continuum/docs/architecture/FORGE-ALLOY-DOMAIN-EXTENSIBILITY.md.

Adds the domain-extension package that the bd4349d checkpoint commit on
this branch SHOULD have built instead of bolting ML-specific fields into
the universal core. Per the never-lose-work rule, the bd4349d state is
preserved on the wip/types-additive-checkpoint-bd4349d branch and is
not destroyed by this commit.

Per TDD/TDValidation discipline: test first, then implementation. The
contract test is in python/tests/test_domain_extension_layout.py; the
existing python/tests/test_regression_published_alloys.py acts as the
end-to-end gate that the 17 published continuum-ai/* artifacts still
validate cleanly through the post-refactor schema.

== What landed

python/forge_alloy/domains/ — new package
  base.py
    DomainExtension ABC. Each registered extension owns:
      - id (the string the alloy's domains[] field carries)
      - stage_types() → dict[str, type] (Pydantic models for stages
        this domain owns)
      - root_extensions() → dict[str, type] (Pydantic models for
        root fields this domain adds)

  registry.py
    DomainRegistry — id-string → DomainExtension class lookup. Mirror
    of scripts/adapters/registry.py and scripts/eval_runners/registry.py
    in sentinel-ai. Strict exact-match dispatch, idempotent same-class
    re-registration, raises on different-class against existing id
    (silent shadowing is the f-word pattern). KeyError on unknown
    id includes the full registered list and the file/registration
    recipe to add the missing one.

  llm_forge.py
    LlmForgeDomain — registered against id 'llm-forge'. Owns every
    ML-specific stage type:
      source-config, prune, train, lora, compact, quant, package,
      eval, publish, deploy, expert-prune, expert-activation-profile,
      compensation-lora, context-extend, modality, deliver

    Owns every ML-specific root extension:
      calibrationCorpora    list[CalibrationCorpusRef]
      priorMetricBaselines  list[PriorMetricBaseline]

    Today, this module RE-EXPORTS the ML types from forge_alloy.types
    where they currently live (the bd4349d checkpoint state). Consumers
    can import from EITHER:
      from forge_alloy import ExpertPruneStage           (legacy public API)
      from forge_alloy.domains.llm_forge import ExpertPruneStage  (new path)
    Both resolve to the same class object today. The full extraction
    (moving the actual class definitions out of types.py and into
    llm_forge.py) is a follow-up refactor commit. The dependency
    direction is strict and enforced by test_universal_core_does_not_import_llm_forge:
    extensions → core, never core → extensions.

  photo_provenance.py
    PhotoProvenanceDomain — stub. Registered against id 'photo-provenance'.
    Empty stage_types and root_extensions today. Witness that the
    registry handles non-ML domains without any change to the universal
    core. Real schemas land when the first photo-provenance artifact
    ships (camera enclave → edits → publish chain).

  ticketing.py
    TicketingDomain — stub. Registered against id 'ticketing'. Empty
    schemas today. Witness for the venue-ticket / FedEx-delivery /
    concert-ticket use case from forge-alloy's APPLICATIONS.md.

  __init__.py
    Module-level singleton + register_domain / resolve_domain /
    registered_domains helpers. Eager imports of llm_forge,
    photo_provenance, ticketing register all three at package import
    time. Adding a new domain is exactly one new file + one import +
    one register() call here.

== Schema gaps caught by the regression test (real bugs, fixed inline)

The python/tests/test_regression_published_alloys.py end-to-end gate
exposed several places where the schema was silently dropping fields
that the published continuum-ai/* alloys actually carry. These were
real bugs (fields the schema didn't know about, dropped on validation,
missing on round-trip) and the fix is to add the missing fields to the
schema and to allow extras everywhere artifact-specific extras land:

  AlloyHardware:
    + device_targets list[str] alias='deviceTargets'
      (every published alloy carries this — was being silently dropped)
    + extra='allow' for any future hardware-tier extras

  AlloyResults:
    + forged_params_b float alias='forgedParamsB'
      (MoE-specific param count for the morning's qwen3-coder-30b-a3b
       and OLMoE flagships — published values were 19.66 and 5.x)
    + active_params_b float alias='activeParamsB'
      (unchanged through expert pruning per § 4.1.3.4)
    + extra='allow' so artifact-specific result extras
      (fourRunProgression, lossFunctionAblation on v2-7b-coder-compensated)
      round-trip cleanly

  BenchmarkResult:
    + score, base_score, delta, calibrated, samples_path, base_samples_path,
      result_hash, base_result_hash, metric — all fields the publish
      pipeline (alloy_to_card.py) and the Tier 4 reproducibility test
      (sentinel-ai/tests/reproducibility/test_published_alloys_scoring.py)
      both consume but the schema was hiding behind a generic 'metrics'
      open dict. Now they're first-class.

  All other BaseModel classes: model_config now has extra='allow' so
    artifact-specific extras (notes, methodology anchor URLs, custom
    provenance fields) preserve verbatim through the round-trip. The
    schema's named fields stay the canonical surface that publish_model.py
    + alloy_to_card.py read; extras are recognized as artifact-specific
    provenance and don't cause silent data loss.

== Test status

  python/tests/test_domain_extension_layout.py:    17 passed
  python/tests/test_regression_published_alloys.py: 3 passed
                                                    (qwen3-coder-30b-a3b,
                                                     olmoe-1b-7b, qwen2.5-coder-7b)
  Combined: 20 forge-alloy tests, 0 failures

Cross-repo sanity: sentinel-ai's reproducibility + unit-test suite still
60 passed / 2 xfailed after this change (the xfails are the same
priorMetricBaselines.samplesHash gap that closes in roadmap step 8).

Side fix: python/tests/test_regression_published_alloys.py
  - sys.path now includes python/ so the script + pytest both find
    forge_alloy without the caller having to PYTHONPATH-set
  - expected_alloy_hash_prefix for qwen3-coder-30b-a3b updated from
    aa61c4bdf463847c → 011970c80c2f3429 to reflect the post-correction
    state pushed in sentinel-ai commit 1bc32d2 (the canonical-evalplus
    humaneval_plus correction)
  - semantic_equivalent treats int/float as numerically equivalent
    when their values match (Pydantic coerces int → float on
    Optional[float] fields and the round-trip emits float)
  - round-trip uses exclude_unset=True (preserves null fields) instead
    of exclude_none=True (was dropping them)

Side fix: .gitignore now excludes __pycache__, *.pyc, *.pyo, .pytest_cache
so Python bytecode never sneaks into commits.

== Next

Roadmap step 6: vision-safety integration (Qwen3VLAdapter consults the
existing scripts/vision_safety.py whitelist). Step 7 unifies the
modelHash convention across publish_model.py and the backfill tools.
Step 8 closes the priorMetricBaselines.samplesHash schema gap and
uploads the calibration corpora alongside the model weights.
The alloy IS the part spec. In the assembly-line metaphor every part
has a spec sheet that travels with it down the line; the alloy carries
the recipe, source, integrity attestation, AND the gate the part must
clear before the shipping department releases it.

Sentinel-ai forges and assays — it NEVER reads acceptanceCriteria.
Continuum (the shipping department) reads BOTH the assayed scores
written into the finished/ manifest AND the alloy's acceptanceCriteria,
and decides ship vs rework. Same alloy → same gate verdict on any forge
run by anyone, anywhere — the spec is portable.

New types:
  BenchmarkAcceptance — per-benchmark floor + 4.1.3.4 anchorDelta gate
  AcceptanceHardware  — maxVramGb + deviceTier
  AcceptanceIntegrity — modelHashRequired + samplesPathRequired
  AcceptanceCriteria  — top-level container

ForgeAlloy.acceptance_criteria is Optional[AcceptanceCriteria] (default
None) — backwards compat: every existing published continuum-ai/* alloy
keeps loading. The field serializes under the camelCase alias
'acceptanceCriteria' to match every other alloy field on disk.

The 4.1.3.4 anchorDelta semantic: negative means 'forged score must be
within |delta| points BELOW the base anchor measured in the same eval
pipeline'. The morning's qwen3-coder-30b shipped at delta -3.7 against
the 92.1 base anchor; the catalog's v2 re-forge alloy declares
anchorDelta: -3.7 to lock in the same gate.

8 new tests, 25/25 forge-alloy passing.
…iven)

The seeder shouldn't be hardcoding training defaults. Each family
adapter knows what corpus/step-count/LR works best for its
architecture and model size. Recipes declare INTENT
({type: train, method: lora}) and the family adapter fills in the
rest at execution time via default_train_params(ctx).

These three fields go from required to Optional[None]. The schema
no longer rejects intent-only train stages. Backwards compat: every
existing alloy that DOES specify them still validates fine because
None is accepted alongside the prior types.
@joelteply joelteply merged commit 1aa413c into main Apr 10, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant