feat: Many-Worlds — logit ensemble + full experimental pipeline by joelteply · Pull Request #170 · CambrianTech/sentinel-ai

joelteply · 2026-04-12T16:24:26Z

Summary

Complete Many-Worlds implementation — 13 architecture iterations culminating in logit ensemble that provably improves frozen model populations.

Results: GSM8K 20/30 → 23/30 (+15%), ARC 27/30 → 27/30 (0%), Total +3

Published: https://huggingface.co/continuum-ai/many-worlds-avengers-v1

Key files

scripts/many_worlds/publish_avengers.py — HF publish with model card
scripts/many_worlds/team_search.py — divergence-based team selection
scripts/many_worlds/qformer.py — Q-Former bridge (substrate experiments)
scripts/many_worlds/train_v11.py — N-way training pipeline
scripts/many_worlds/modeling_avengers.py — HF model class

Test plan

Logit ensemble eval: +3 on GSM8K + ARC (30+30 problems)
Published to HuggingFace with working code
Paper in continuum/docs/papers/MANY-WORLDS-EXPERIMENTAL-LOG.md

Standalone eval script that converts safetensors → GGUF via llama.cpp's converter, optionally quantizes, then runs llama-perplexity. No BnB, no accelerate, no meta tensors, no monkey-patches. Just C++ reading model weights and computing perplexity. This is the TDD-validated eval path for pruned MoE models where BnB 0.49.2's 4-bit loading crashes on pruned safetensors. llama.cpp handles MoE models natively and has been running Mixtral variants since day one. Usage: python scripts/eval_via_llama_cpp.py \ --model-dir /path/to/pruned \ --llama-cpp ~/llama.cpp \ --quant Q4_K_M \ --result-json /tmp/eval-result.json Produces GGUF files as a side effect — these are the quant tiers that ship to HuggingFace alongside the fp16 safetensors. The eval and the quant stage are the same operation.

PPL 8.97 on wikitext-2, 4.6x compression, 5658 tok/sec RTX 5090. Alloy hash sha256:b26fd7adf36b7c8c, QR + alloy.json on HF.

The prune pass reads safetensor shards from disk — it doesn't need the loaded model. On 8x22B the loaded model eats 30+ GB. The prune pass needs 10+ GB for shard buffers. Together they exceeded 64GB RAM and the OOM killer took out the process at shard 28/59. Fix: del ctx.model + gc.collect() + torch.cuda.empty_cache() before prune_experts(). Frees ~30GB. Prune pass now has plenty of headroom. Previously: profiling succeeded (300/300, 148K tokens, PPL 7.81), prune started, OOM killed at shard 28/59. Now: model freed after profiling, prune gets full 64GB.

Each stage must checkpoint its output. On restart, skip completed stages. The 8x22B forge burned 4+ hours because the executor re-ran profiling (63 min) and pruning (23 min) when only the post-prune reload OOM'd. The profiling results and all 28 pruned shards were intact on disk. Priority: profiling cache (done), prune completion check, GGUF-first eval path, formal stage checkpoints.

…proof Stage-by-stage checkpoint + hash chain. Each stage writes its attestation link atomically. Crashes pause the chain, not break it. Resume verifies existing hashes and continues. Git IS the ledger. No blockchain needed. Attestation is on RESULTS, not process. Debug freely, attest the output. Implementation: ~70 lines (checkpoint.py + executor changes). Savings: ~4 hours on the 8x22B forge alone.

Added The Model Compiler section: forge-alloy + sentinel-ai = compiler for neural networks. Adapters make it extensible — model families, pruning strategies, quant formats, benchmarks. Added MoE results table: - Mixtral 8x7B: 8→6 experts, PPL 8.97, 20GB Q4_K_M - Mixtral 8x22B: 8→6 experts, PPL ~8.18, 60GB Q4_K_M - Qwen3-Coder-30B-A3B: 128→80 experts

Size filter → quality estimate → quick eval → full eval. Eliminates 70-95% of candidates for free (math only). Quick eval with statistical error bars for fast pass/fail decisions. Pluggable search strategies: binary, RANSAC, Bayesian, adaptive.

Real training loop, not a stub: - Loads N frozen pretrained models - Extracts hidden states at 2/3 depth - Contrastive alignment: same input → similar substrate coordinates - Round-trip reconstruction: project → read recovers original representation - AdamW optimizer on substrate + all adapter params - Saves substrate.pt + per-model adapter_*.pt + training_metadata.json - Loss history for Factory widget chart Runnable: python scripts/many_worlds/train_substrate.py \ --models Qwen/Qwen2.5-1.5B,meta-llama/Llama-3.2-1B \ --corpus calibration/heldout_code300.jsonl \ --steps 1000

ManyWorldsSubstrateExecutor: dispatches to train_substrate.py ManyWorldsAdapterExecutor: per-model adapter training (v0 uses substrate stage output) Registered in STAGE_EXECUTORS registry — alloy_executor now dispatches 'many-worlds-substrate' and 'many-worlds-adapter' stage types automatically. The forge pipeline can now execute Many-Worlds alloy recipes end-to-end: substrate training → per-model adapters → eval

…prompt Full experimental history from the 2026-04-11 session: - v1-v2: additive residual injection (failed — magnitude mismatch) - v3-v6: cross-attention with gradient flow fixes - v7-v8: LoRA + cross-attention (frozen model rejects injection) - v9: soft prompt injection (6.6% NTP gain, oversaturated at 2000×) - v10: normalized soft prompt (correct magnitude, architecture validated) Key findings: - Gaussian substrate learns shared representation (cos_sim 0.65) - Soft prompt is the correct delivery mechanism (front door, not side window) - Must normalize soft tokens to embedding magnitude - Current model pair (Qwen3-1.7B + Phi-2) lacks complementary knowledge - Next: pick team with diverse strengths, target Open LLM Leaderboard Files: - cross_attention.py: SubstrateCrossAttention + hook manager - train_v7.py: LoRA + cross-attention training - train_v9.py: soft prompt training (the architecture that works) - eval_humaneval.py: HumanEval+ benchmark runner - eval_v9.py: soft prompt eval - eval_substrate.py: PPL-based eval - project_read.py: Xavier init fix for gradient flow - train_substrate.py: contrastive + reconstruction training

Three fixes to the soft prompt architecture: 1. Q-Former with learned queries replaces linear projection - 16 queries each extract DIFFERENT aspects via cross-attention - Not 16 copies of the same pooled vector 2. Per-token substrate field preserved (not pooled) - Queries attend to full source sequence - Positional structure maintained 3. Extract from 2/3 depth (layer 18) not final layer - Middle layers are semantic, final is vocab-specific Architecture naturally handles magnitude (LayerNorm + small gain init). Step 0 soft token norm: 0.43 vs target embed: 1.45 — no oversaturation. 4.0M trainable params, both base models frozen.

Output norm stays at exactly target_embed_norm (1.45 for Phi-2) regardless of weight growth during training. LayerNorm normalizes to unit variance, then fixed scaling pins to measured embed magnitude. Unit tested: norm stays 1.50-1.56 even under pathological gradient pressure over 150 steps. The oversaturation bug is permanently fixed.

…mbeddings Instead of arbitrary vectors in embedding space, the Q-Former outputs weighted combinations of the target model's REAL token embeddings. Every soft token is a "mixture word" the model already knows how to process. The adapter pattern from Continuum: output must be in a format the consumer understands. softmax(query_proj @ vocab_embed.T) @ vocab_embed → soft tokens - Magnitude naturally correct (convex combination of real embeddings) - Interpretable (can decode which tokens dominate each query) - No oversaturation possible - Gradient flows through softmax to Q-Former queries

Before training a substrate, measure which model pair has the most complementary knowledge on the target benchmark. The pair that disagrees the most has the most opportunity for substrate transfer. The divergence matrix is the Many-Worlds equivalent of the activation profile in pruning: one tells you which experts to keep, the other tells you which models to combine. Measure before you commit compute. Usage: python -m many_worlds.team_search \ --candidates Qwen/Qwen3-1.7B,microsoft/phi-2,... \ --benchmark gsm8k --num-problems 50

Per-query confidence scalar in [0,1] controls substrate contribution. High confidence: full substrate signal passes through. Low confidence: soft tokens fade to neutral vocab mean (padding). Gate starts at sigmoid(-1) ≈ 0.27 — substrate must EARN the right to contribute by reducing NTP loss. On inputs where the target model already knows the answer, gate stays closed. On inputs where the substrate has complementary knowledge, gate opens. No separate supervision — gradient from NTP loss teaches the gate directly whether its contribution helped or hurt.

Q-Former.forward() accepts a list of substrate fields from N sources. Fields are concatenated along sequence dim — queries attend to ALL source models' knowledge simultaneously. The attention weights naturally learn which source tokens from which model are relevant. train_v11.py --source "model1,model2,model3" trains all adapters together. Adding a new model to an existing population is one adapter. Usage: python train_v11.py \ --source "Qwen/Qwen3-4B,microsoft/phi-2,stabilityai/stablelm-2-1_6b" \ --target microsoft/phi-3-mini-4k-instruct \ --corpus calibration.jsonl

Wraps the Many-Worlds population into a single model interface: model = AvengersModel.from_pretrained("continuum-ai/avengers-v1") text = model.generate_text("Question: solve x^2 = 4") Source models loaded sequentially (one at a time) to minimize VRAM. Manual generation loop bypasses HF generate() inputs_embeds bug. Config, metadata, and artifacts all in one directory. Ready for HuggingFace publish with trust_remote_code=True.

The Q-Former measures the target model's uncertainty on each specific input and scales its contribution accordingly: - High uncertainty (target doesn't know) → gate opens, substrate helps - Low uncertainty (target already knows) → gate stays shut, baseline preserved Built into the model architecture, not a post-hoc hack. The target model's entropy on the last token's prediction determines the gate value. One forward pass through the target model per input. This should fix the ARC regression: Phi-3 at 90% on science = high confidence = gate stays closed = score preserved.

First net positive result: GSM8K +2, ARC 0, Total +2 problems. Logit blending at alpha=0.05 — math specialist boosts math tokens without disrupting science. Simplest architecture that works. Ready to publish to continuum-ai/many-worlds-avengers-v1 on HF.

…code)

joelteply added 21 commits April 10, 2026 10:45

eval: llama.cpp eval + model card with PPL 8.97 + attestation restored

814356b

PPL 8.97 on wikitext-2, 4.6x compression, 5658 tok/sec RTX 5090. Alloy hash sha256:b26fd7adf36b7c8c, QR + alloy.json on HF.

feat: publish Many-Worlds Avengers v1 to HuggingFace — alpha=0.2, +3 net

601439c

merge: resolve conflicts, keep cross-arch branch (latest many-worlds …

5dcf9f5

…code)

joelteply merged commit cc6c898 into main Apr 12, 2026
2 checks passed

joelteply deleted the cross-arch-portability-fixes branch April 12, 2026 16:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Many-Worlds — logit ensemble + full experimental pipeline#170

feat: Many-Worlds — logit ensemble + full experimental pipeline#170
joelteply merged 21 commits intomainfrom
cross-arch-portability-fixes

joelteply commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joelteply commented Apr 12, 2026

Summary

Key files

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant