Merged
Conversation
Update vendor/llama.cpp to c08d28d08 (post-April 4) picking up all Gemma 4 PRs: core model support (#21309), template fixes (#21326), tokenizer fix (#21343), logit softcapping (#21390), newline split (#21406), and dedicated tool-call parser (#21418). Add ModelFormat::Gemma4 with detection, stop sequences, and chat templates. Add ModelChoice variants for E2B, E4B, and 26B-A4B with full registry wiring. Add arkavo_chat_parse FFI exposing llama.cpp's native PEG output parser for Gemma 4 tool calls. Provider tries native parser before fallback chain. Tool bench results (Q4_K_M, Apple Silicon): - Gemma-4-E2B (2.3B active): 8/8, 2,229ms - Gemma-4-26B-A4B (4B active MoE): 8/8, 7,410ms - Gemma-4-E4B (4.5B active): 1/8 — blocked on non-lazy grammar sampler Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Spec Coverage Delta
Newly Covered
Spec Coverage Report
Quality Gate
WIP Scenarios (24) — tracked via issues
Uncovered Scenarios (478)
|
Addresses three critical bugs from RimWorld Gemma 4 testing: A2A messages bypassing MCP tool pipeline, conversation context resetting every cycle, and fragmented event processing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
send_advisory_task confirmed self-contained (mesh_state + protocol only). BPE merge table sourced from Llama 3.1 tokenizer.json (Apache 2.0). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
12 tasks across 3 phases. Phase 1: standalone modules (agent_event, token_estimator, conversation_window). Phase 2: event loop integration with overnight test gate. Phase 3: ToolMemory cleanup. Updated spec: LlamaTokenEstimator wraps loaded model's tokenizer instead of vendoring BPE merge table. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…dingMessage Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sationWindow Add two new Router methods for Phase 2 agent loop context budgeting: - min_feasible_context_size(): iterates loaded models via model_registry.model_names(), returns minimum trained context size (default 4096 when no models loaded) - any_loaded_model(): returns Arc<LlamaModel> from first loaded model for token estimation Also re-exports LlamaModel from arkavo-llm so router can reference it without a direct arkavo-llama-cpp dependency. Both methods are feature-gated behind llama-cpp with a fallback returning 4096 for non-llama builds. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… history Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Gemma/Llama chat templates require alternating user/assistant roles. When error cycles push user messages without assistant responses, consecutive user messages break the template. build_messages() now merges consecutive same-role messages before returning. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Narrow pub to pub(super) for ConversationWindow and MockEstimator. Remove ToolMemory.pending_instructions (never written to or read). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two fixes found during playtest: - Conductor: augment last user message with learning guidance when existing_messages is provided (was computing but discarding it) - Agent loop: pass purpose as system_prompt for classification_content hint (was None, losing domain context for the classifier) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The classifier's task.len() > 300 rule falsely classified every orchestrator cycle as complex (cycle prompts include ToolMemory output), causing 7+ minute startup from unnecessary task decomposition. Replace with a 0.8B model call via route_chat (chat_semaphore, won't block main inference). The model classifies SINGLE vs MULTI in ~50ms. Falls back to the heuristic classifier on timeout or model error. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Clarify that sequential workflows (register→observe→act) are SINGLE. Default to SINGLE on error/timeout — false negatives are cheap, false positives cost 80+ seconds of decomposition overhead. Bump logging to INFO for production visibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The RPC handler was constructed before start_orchestrator_loop set the event sender, so it cloned None and never received A2A messages through the event channel. A2A messages kept spawning separate conductor calls, racing the orchestrator for the GPU. Now A2aRpcImpl holds Arc<Mutex<Option<Sender>>> (shared reference) instead of Option<Sender> (snapshot). The handler locks the mutex at call time to get the live sender. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both stdio and HTTP MCP clients only checked JSON-RPC level errors
(response.error) and ignored the MCP-spec isError field in
CallToolResult. Tool errors like "Serialization error: missing field
AgentType" were returned as Ok({"result": "error text"}), making
the executor and ToolMemory treat them as successful calls.
Now returns Err when isError is true, so the executor records it
as a failure and the model gets error feedback for retry.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Gemma-4-E2B (2.3B active, 8/8 tool accuracy) replaces Ministral-3B as the preferred fast model for judge, synthesis, and classification tasks. Falls back to Ministral-3B if Gemma 4 is not cached. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Gemma-4-E2B, E4B, and 26B-A4B were missing from feasible_models(), so Thompson Sampling never considered them as candidates. The models loaded but only the hardcoded Ministral/Qwen variants were eligible. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Gemma-4-E4B produces non-lazy GBNF grammar that our standalone sampler can't handle, resulting in 1/8 tool calling accuracy. Thompson Sampling wastes 30+ seconds on validation retries before learning to avoid it. Exclude until PEG output parser is ready. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace format_for_prompt() with format_control_signals() -> Option<String>. ToolMemory now emits only derived signals the model can't see in raw conversation history: setup state, duplicate warnings, action variety, and error pattern escalation. Silent (None) when everything is fine. ConversationWindow carries the raw history. Control signals go through system_suffix — separate token budget, not in the cycle prompt. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Case-insensitive model name matching in from_name() - Strip "call:" prefix from Gemma 4 curly-brace tool call format - Planner waits for executor/judge feedback before next round Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Specialists commented out in launch script until they have proactive mesh tool usage. Commander AGENTS.md updated for Gemma 4. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
On single GPU, the judge's route_fast() call contended with the planner for Metal compute, adding 3-8s per tool result to the feedback loop. Replace with condense_tool_result() which extracts Delta sections and truncates — zero GPU time. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Orchestrator cycle prompts and notification events are always single tasks — the LLM complexity assessment (route_chat GPU call) was pure overhead. New skip_complexity parameter bypasses it for these callers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Try structural condensation first (Delta extraction, free). If it meaningfully reduces size (>50% reduction), use it. Otherwise fall back to LLM distillation for unstructured text. Feedback budget increased from 200 to 800 chars so the planner sees useful data. Works for any MCP server output, not just game-rl JSON. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The raw ModelRegistry path hung for 15+ min on 26B MoE due to Metal shader compilation. The Router path matches what `arkavo chat` uses — same model loading, context pool, and inference semaphore. Verified: Ministral-3B completes 3 scenarios in 5.4s via Router path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Round 0 (planning) keeps full budget: thinking on, 16K tokens, full schema. Round 1+ (execution) switches to execution profile: temp 0.1, thinking off, max 200 tokens. Execution mode now respects model hints from AGENTS.md instead of always falling back to the fastest local model. This fixes the 26B MoE generating 13.5K tokens on round 1 when it only needed ~200 for a tool call. Expected round 1 time: ~3-4s instead of 7min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Registration responses carry 5KB+ ActionSpace/ObservationSpace schemas
that blow up context windows (14K tokens after Jinja expansion). The
condenser now replaces large arrays (>5 items) and objects (>5 fields,
>500 chars) with count summaries like "[30 items]" or "{6 fields}".
This fixes the GPU fault at position 16384 caused by context overflow
when the 26B model processes registerAgent results.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…flow Round 1 passes None for tool_registry so the Jinja template doesn't inject 8 tool schemas (which expand 567 content tokens to 16K+ actual tokens, causing GPU faults at position 16384). The model already has tool schemas from round 0's conversation history. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…esult roles RLM context size check now uses the actual model hint instead of defaulting to 8K, preventing false RLM activation on models with larger context windows (e.g., gemma-4-26b-a4b at 16K). Also eliminates a redundant RlmBridge instance that was created just for system prompt generation. Schema stripping in condense_tool_result now uses a generic is_schema_shaped() heuristic that detects JSON Schema patterns (arrays of objects with type+description fields) instead of stripping any large object. Observation data (colonists, resources, alerts) now survives condensation (791 chars vs 92 chars previously), giving the planner enough context to formulate actions. Parallel planner tool results now use Message::tool_result() with proper call_id and tool_name instead of Message::user(). Jinja templates (especially Gemma 4) render correct <|tool_response> tokens, fixing garbled model output on round 1+ caused by missing conversation context. Commander AGENTS.md updated to use gemma-4-26b-a4b model with colony-lost reset policy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
Author
Tested llama.cpp PR ggml-org/llama.cpp#21760Cherry-picked all 5 commits onto our vendored llama.cpp (
Test results
Our fix needed alongside
RecommendationApprove the upstream PR. The PEG parser fixes are correct and working. |
The Err branch of execute_tool_calls was missing add_fast_lesson, so schema violations (e.g. missing "Type" field) were never persisted as corrective lessons. The model saw the error within one tool loop iteration but lost it when the conversation cleared between agent cycles, repeating the same mistake indefinitely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…cdc11a Learning system: - Add fast-path lesson for MCP transport errors (Err branch was missing add_fast_lesson, so schema violations were lost between agent cycles) - Add route_synthesis: plain completion with largest loaded model for lesson synthesis (0.8B/3B models can't produce structured JSON reliably) - Restructure synthesis prompt to focus on action→outcome pairs, strip read-only tool calls from episode data before synthesis - Add is_procedural_lesson filter to reject lessons about observation sequencing — the agent loop handles that, lessons should be strategic - Derive Default for Message and Role to reduce struct literal boilerplate Model support: - Add Gemma 4 31B dense model variant (LocalGemma4_31B) across router, selector, quality, tool extraction, architect executor, and UI - Categorize 31B as XLarge tier (15-45s inference, best for background synthesis tasks, too slow for real-time agent loops) llama.cpp: - Update to e21cdc11a (merged PR #21760: Gemma 4 parsing edge cases) - Update all CI workflow files to pin new commit - Fix mtmd_decode_use_non_causal API change (now takes chunk parameter) RimWorld agent mesh: - Update commander AGENTS.md with intent-based spatial actions (PlaceBuildingNear, EstablishFarm, etc.) and alert→action mappings - Fix crop name (RawPotatoes → Potato), anchor (Stockpile → MapCenter) - Add combat response mapping for raids Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…p test - test_feasible_models_gemini_only: GeminiPro was removed from feasible set in d422770 but test still asserted its presence - test_min_feasible_context_size_default: gracefully skip when llama-cpp feature is not enabled (CI runs with --no-default-features) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
With GeminiPro removed from feasible set, gemini-only gives a single model (GeminiFlash) which takes the single-model shortcut, bypassing Thompson Sampling entirely. Test now enables both Gemini and Anthropic to ensure multiple feasible models. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- prefer_capable parameter unused when llama-cpp feature is off - ui.rs match missing Gemma 4 E2B/E4B/26B/31B variants Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Clippy correctly flags public functions that dereference raw pointers without being marked unsafe. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
rand unsoundness requires a custom logger that accesses ThreadRng inside the log handler — not applicable to our usage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
LlamaModel is only exported from arkavo_llm on non-musl targets, but any_loaded_model was gated on just llama-cpp feature without the musl exclusion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
model_registry.get() returns Arc<LlamaModel> which is () on musl. Both the implementation and fallback cfg gates need to account for the musl case where llama-cpp feature is on but LlamaModel is not. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Newer clippy on CI (1.94) catches these: - Default::default() → Map::default() for clarity - raw pointer as-cast → .cast() for constness safety - usize as i32 → i32::try_from().unwrap_or() for wrapping safety Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…teln, pub visibility Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
llama.cpp update to e21cdc11a added Gemma 4 and DeepSeek v3.2 parsers, pushing the release binary from 59MB to 61MB. Bump limit from 60MB to 65MB to accommodate upstream growth. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
d78c1a3 to
c5316e1
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ModelFormat::Gemma4with detection, stop sequences, chat templates across all cratesModelChoicevariants for Gemma-4-E2B, E4B, and 26B-A4B with full registry wiring (repo IDs, GGUF filenames, size estimates, escalation paths, detail levels)arkavo_chat_parseFFI exposing llama.cpp's native PEG output parser — provider tries native parser before our fallback chainTool Bench Results (Q4_K_M, Apple Silicon)
E4B requires non-lazy grammar sampler integration (generation_prompt prefill) which our standalone sampler doesn't support yet. Commented out from bench discovery.
Test plan
cargo build -qcompiles cleanlycargo clippy -- -D warningspasses on changed cratescargo test -p arkavo-llama-cpp— 18 tests pass (including new Gemma 4 format detection)cargo test -p arkavo-llm --lib— 209 tests passcargo test -p arkavo-torg --lib— 14 tests passarkavo tool-bench --model gemma-4-e2b— 8/8arkavo tool-bench --model gemma-4-26b-a4b— 8/8🤖 Generated with Claude Code