feat: integrate router policy v1.0.0 with tests and CI (#341) by chf3198 · Pull Request #342 · chf3198/devenv-ops

chf3198 · 2026-04-21T04:05:15Z

This PR integrates the router policy spec with comprehensive testing and CI validation.

Changes

Router Policy v1.0.0: Added versioned policy spec to router.agent.md with keyword-based lane classification and escalation rules
Unit Tests: Created tests/router-policy.test.js with 5 test cases covering all lanes
CI Integration: Added test job to .github/workflows/lint.yml to run router policy tests on PR/push
Dashboard Improvements: Enhanced GitHub panel with server-side retry/backoff and client-side error handling
Version Bump: Updated to v3.1.0 with changelog entry

Validation

All router policy tests pass (5/5)
CI workflow includes test job
Branch naming convention followed
Ticket Document router policy spec from matrix analysis #341 closed with policy spec added

Closes #341

- Merge empirical results INTO Dynamic Tracking Table (single unified view) - Add Empirical Composite column to all rows - Expand 6 aggregate free-tier rows → 17 individual per-model rows - Copilot Pro rows retain full analytical rubric; free-tier rows show empirical scores - Remove disconnected Section B (EMPIRICAL-RESULTS-START block) - Fix OpenClaw endpoint: port 4000 (LiteLLM proxy), OPENCLAW_DEVICE_PASSWORD auth - Correct model names: ollama/mistral, ollama/qwen2.5:7b-instruct - OpenClaw rows marked fleet offline (LiteLLM service not listening on port 4000) - Split providers file: fleet providers → ai-matrix-providers-fleet.js (lint compliance) - Update methodology section to document dual-rubric approach Closes #339

…wup) - Fix provider config: direct Ollama port 11434 (no LiteLLM proxy), stream:true for fleet models, max_tokens:300 cap, auth: Bearer ollama - Add phi3:mini as third fleet provider (3.8B, fastest at ~33s/300tok) - Fix updater: AbortController timeout (360s fleet / 45s cloud), SSE streaming reassembly for fleet responses - Live empirical scores: - openclaw_mistral: 5.8 composite (slow ~105s/100tok, strong accuracy) - openclaw_phi3: 6.8 composite (33s/300tok, best speed) - openclaw_qwen: 8.0 composite (best quality, ~7B on CPU) - Update matrix: replace fleet-offline markers with live scores - Fix inventory/services.json: correct OpenClaw entry (Ollama direct, no LiteLLM)

updateMatrix() now strips any EMPIRICAL-RESULTS-START block and updates the Empirical column directly in the Dynamic Tracking Table rows. Section B is permanently retired — the unified table is the source of truth.

Removed Prop Score, Variability, and Emergent Property columns which were only meaningful for Copilot Pro rows. All models now scored on identical columns: Arch | Sec | UX/UI | Read. | Empirical | Best Use Case. Rows sorted by empirical score (high to low) within each tier.

- Add versioned router policy spec to router.agent.md - Create router-policy.test.js with 5 test cases - Add test:router script to package.json - Add test job to CI workflow for router policy validation - Improve dashboard GitHub panel with retry/backoff and error UI - Update ticket statuses in event log

- Close #98: Added research/copilot-chat-hooks-research.md with adapter designs - Close #118: Confirmed model in research/ticket-status-model.md and prototype in scripts/ - Updated ticket statuses in event log

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0c09525307

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-21T04:09:22Z

+    // Match rows containing the label text, update the empirical column (second-to-last cell before Best Use Case)
+    const escaped = meta.label.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
+    content = content.replace(
+      new RegExp(`(\\|[^|]*${escaped}[^|]*\\|(?:[^|]*\\|){7})([^|]*)(\\|[^|]*\\|)`, 'm'),


Fix row-matching regex so empirical scores are updated

updateMatrix never updates the matrix because the replacement regex does not match the current table shape in model-compare/design-analysis/LLM-EVALUATION-MATRIX.md (8 columns). The pattern at this line consumes too many |...| groups before trying to replace the empirical cell, so content.replace(...) is a no-op for every provider. Running ai-matrix-build-final.js after new eval runs will therefore leave stale empirical values in the matrix.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-21T04:09:22Z

+  openrouter_qwen3coder:  { label: 'OpenRouter — qwen/qwen3-coder:free',          cost: '0x / free-cloud', tier: 'OpenRouter' },
+  openrouter_nemotron:    { label: 'OpenRouter — nvidia/nemotron-super-120b:free', cost: '0x / free-cloud', tier: 'OpenRouter' },
+  openrouter_llama70b:    { label: 'OpenRouter — llama-3.3-70b-instruct:free',     cost: '0x / free-cloud', tier: 'OpenRouter' },
+  openrouter_hermes405b:  { label: 'OpenRouter — hermes-3-llama-3.1-405b:free',   cost: '0x / free-cloud', tier: 'OpenRouter' },


Align provider label with matrix row for hermes key

The label for openrouter_hermes405b does not correspond to any row in the matrix (and does not match the model configured for that key in scripts/ai-matrix-providers.js). Because updateMatrix matches rows by exact label text, this provider's empirical score can never be written back to LLM-EVALUATION-MATRIX.md, so one provider is always skipped during final matrix generation.

Useful? React with 👍 / 👎.

Copilot

Pull request overview

Integrates a versioned router policy (v1.0.0) into the router agent docs and adds lightweight CI validation, alongside dashboard GitHub panel resilience improvements and new matrix-evaluation tooling/docs.

Changes:

Added router policy spec + a Node-based router policy test, and wired it into GitHub Actions.
Improved dashboard GitHub summary fetching with server-side retry/backoff and client-side timeout + error UI.
Added ai-matrix evaluation runner/scoring/provider configs and several new research/design-analysis artifacts.

Reviewed changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
`agents/router.agent.md`	Documents router policy spec + embeds versioned policy JSON.
`tests/router-policy.test.js`	Adds a simple CLI-style test suite for lane classification.
`.github/workflows/lint.yml`	Adds a CI job to run router policy tests.
`package.json`	Bumps version and adds `test:router`; adds `node-fetch` dependency.
`scripts/dashboard-server.js`	Adds retry/backoff behavior for `/api/github/summary`.
`dashboard/js/github-monitor.js`	Adds client timeout + error UI for GitHub panel.
`scripts/copilot-chat-adapter.js`	Adds a Copilot Chat adapter skeleton.
`scripts/ticket-assignment-suggester.js`	Adds prototype role-suggestion logic for ticket metadata.
`scripts/ai-matrix-updater.js`	Adds controlled LLM evaluation runner and output persistence.
`scripts/ai-matrix-scorer.js`	Adds heuristic scoring for evaluation responses.
`scripts/ai-matrix-providers.js`	Adds provider configs and merges fleet providers.
`scripts/ai-matrix-providers-fleet.js`	Adds local fleet provider configs (OpenClaw/Ollama).
`scripts/ai-matrix-matrix-writer.js`	Writes empirical run results into the evaluation matrix markdown.
`scripts/ai-matrix-build-final.js`	Aggregates best-of-run results and updates the matrix.
`model-compare/design-analysis/LLM-EVALUATION-MATRIX.md`	Adds/initializes the evaluation matrix content.
`model-compare/design-analysis/eval-prompt.md`	Adds the evaluation prompt used for scoring.
`model-compare/design-analysis/raptor-mini-design-analysis.md`	Adds a design-analysis artifact.
`model-compare/design-analysis/gpt-5-4-mini-design-analysis.md`	Adds a design-analysis artifact (currently over lint line limit).
`model-compare/design-analysis/fleet-resource-table-design-analysis-gpt-5.3-codex-2026-04-20.md`	Adds a design-analysis artifact (currently over lint line limit).
`model-compare/design-analysis/fleet-resource-table-design-analysis-claude-sonnet-4.6-2026-04-20.md`	Adds a design-analysis artifact (currently over lint line limit).
`research/router-matrix-router-analysis.md`	Adds a router/matrix analysis research doc.
`inventory/services.json`	Reformats/expands service entries.
`CHANGELOG.md`	Adds 3.1.0 release notes (currently over lint line limit).
`tickets/98-copilot-chat-hooks-research.md`	Adds ticket tracking doc.
`tickets/100-visual-qa-audit.md`	Adds ticket tracking doc.
`tickets/118-ticket-status-assignment-research.md`	Adds ticket tracking doc.
`tickets/119-unknown.md`	Adds ticket tracking doc.

Copilot · 2026-04-21T04:11:03Z

+      "backend": "sonnet",
+      "recommendedModel": "Claude Sonnet 4.6",


The new “premium” lane recommends “Claude Sonnet 4.6”, but the router’s Tier 1 handoff is labeled as “Deep Work (Opus)” and routes to the architect agent. This creates conflicting guidance in the same doc (premium/“deep work” could mean Sonnet or Opus). Please reconcile the lane→model mapping with the tier/handoff definitions (e.g., make premium map to Opus/architect, or update the Tier 1 label/model guidance accordingly).

Suggested change

"backend": "sonnet",

"recommendedModel": "Claude Sonnet 4.6",

"backend": "opus",

"recommendedModel": "Claude Opus",

Copilot · 2026-04-21T04:11:04Z

 # Changelog

-## [3.0.1] - 2026-04-14
+## [3.1.0] - 2026-04-20
+
+### Added — Router Policy Integration (#341)
+- **Versioned router policy** (v1.0.0) in `router.agent.md` with keyword lanes and escalation
+- **Router policy tests** (`tests/router-policy.test.js`): 5 test cases validating classification
+- **CI test job** in `.github/workflows/lint.yml` for router policy validation
+- **Dashboard GitHub panel improvements**: Server retry/backoff (3 attempts), client timeout (6s), error UI with retry button
+- **Ticket status tracking**: Event log updates for ticket lifecycle management


This file now exceeds the repo’s ≤100-line limit enforced by scripts/lint.js (it currently has >100 lines), so npm run lint will fail in CI. Please split the changelog into smaller files/sections (or move older entries to an archive file) so each checked file stays ≤100 lines.

Copilot · 2026-04-21T04:11:04Z

+- Reviewed local Karpathy LLM Wiki / governance material using repo search:
+  - `instructions/wiki-knowledge.instructions.md`
+  - `skills/llm-wiki-ops-portable/SKILL.md`
+  - `WIKI.md`
+  - `wiki/index.md`
+  - `wiki/concepts/wiki-pattern.md`
+  - `wiki/sources/karpathy-llm-wiki-pattern.md`
+  - `research/adr/007-llm-wiki-adoption.md`
+  - `raw/articles/karpathy-llm-wiki-pattern.md`
+- Cross-checked existing design-analysis notes in sibling files:
+  - `model-compare/design-analysis/gpt-5-mini-design-analysis.md`
+  - `model-compare/design-analysis/grok-code-fast-1-design-analysis.md`
+  - `model-compare copy/design-analysis/gemini-3-flash-preview-design-analysis.md`
+- Fetched current external references on structured-output, Vault, and GitHub Actions triggers.
+
+## Web research summary
+- OpenAI structured-output / function-calling docs redirected in the fetch tool, but the core guidance remains clear: prefer schema-constrained structured output, treat JSON as an enforced contract, and validate before consumption.
+- HashiCorp Vault docs emphasize managed static secrets, identities/authentication, dynamic credentials, sensitive-data handling, and auditability. Recent 2026 HashiCorp blog posts highlight agentic runtime security, secret inventory reporting, and workload-identity federation.
+- GitHub Actions workflow docs confirm the repo can use `schedule`, `issues`, `issue_comment`, `workflow_dispatch`, `workflow_run`, and `repository_dispatch` triggers. This supports an activity-triggered primary workflow plus a privileged follow-up workflow if needed.
+- ArXiv search via the tool was noisy and the query endpoint returned 400, so I relied on repo-held synthesis files and the current research notes already embedded in the workspace for the consensus/validation strategy.
+
+## Critical analysis — what is strong
+- The earlier proposal is directionally right about separating concerns:
+  - deterministic fetch/extraction first,
+  - LLM normalization second,
+  - validation third,
+  - review/commit last.
+- Using Karpathy Wiki material as the prompt seed is the right governance move because the repo already documents the wiki pattern as the compiled knowledge layer.
+- Vault as the primary secrets store is the right security choice; it matches the repo’s security posture and keeps secrets out of Git and LLM prompts.
+- A lock/unlock table with per-resource edit modals is a good UX pattern because it makes secret mutation explicit and auditable.
+
+## Critical analysis — what is weak or risky
+- **Too much trust in LLM consensus for facts.** Multiple LLMs can agree on the same wrong value. Consensus should only decide between candidate normalizations, not replace authoritative source checks.


This new markdown artifact is >100 lines (lint-enforced limit for .md files), which will cause npm run lint to fail. Split this analysis into multiple ≤100-line markdown files (e.g., part-1/part-2) or otherwise reduce length.

Suggested change

- Reviewed local Karpathy LLM Wiki / governance material using repo search:

- `instructions/wiki-knowledge.instructions.md`

- `skills/llm-wiki-ops-portable/SKILL.md`

- `WIKI.md`

- `wiki/index.md`

- `wiki/concepts/wiki-pattern.md`

- `wiki/sources/karpathy-llm-wiki-pattern.md`

- `research/adr/007-llm-wiki-adoption.md`

- `raw/articles/karpathy-llm-wiki-pattern.md`

- Cross-checked existing design-analysis notes in sibling files:

- `model-compare/design-analysis/gpt-5-mini-design-analysis.md`

- `model-compare/design-analysis/grok-code-fast-1-design-analysis.md`

- `model-compare copy/design-analysis/gemini-3-flash-preview-design-analysis.md`

- Fetched current external references on structured-output, Vault, and GitHub Actions triggers.

## Web research summary

- OpenAI structured-output / function-calling docs redirected in the fetch tool, but the core guidance remains clear: prefer schema-constrained structured output, treat JSON as an enforced contract, and validate before consumption.

- HashiCorp Vault docs emphasize managed static secrets, identities/authentication, dynamic credentials, sensitive-data handling, and auditability. Recent 2026 HashiCorp blog posts highlight agentic runtime security, secret inventory reporting, and workload-identity federation.

- GitHub Actions workflow docs confirm the repo can use `schedule`, `issues`, `issue_comment`, `workflow_dispatch`, `workflow_run`, and `repository_dispatch` triggers. This supports an activity-triggered primary workflow plus a privileged follow-up workflow if needed.

- ArXiv search via the tool was noisy and the query endpoint returned 400, so I relied on repo-held synthesis files and the current research notes already embedded in the workspace for the consensus/validation strategy.

## Critical analysis — what is strong

- The earlier proposal is directionally right about separating concerns:

- deterministic fetch/extraction first,

- LLM normalization second,

- validation third,

- review/commit last.

- Using Karpathy Wiki material as the prompt seed is the right governance move because the repo already documents the wiki pattern as the compiled knowledge layer.

- Vault as the primary secrets store is the right security choice; it matches the repo’s security posture and keeps secrets out of Git and LLM prompts.

- A lock/unlock table with per-resource edit modals is a good UX pattern because it makes secret mutation explicit and auditable.

## Critical analysis — what is weak or risky

- **Too much trust in LLM consensus for facts.** Multiple LLMs can agree on the same wrong value. Consensus should only decide between candidate normalizations, not replace authoritative source checks.

- Reviewed local wiki/governance sources: `instructions/wiki-knowledge.instructions.md`, `skills/llm-wiki-ops-portable/SKILL.md`, `WIKI.md`, `wiki/index.md`, `wiki/concepts/wiki-pattern.md`, `wiki/sources/karpathy-llm-wiki-pattern.md`, `research/adr/007-llm-wiki-adoption.md`, and `raw/articles/karpathy-llm-wiki-pattern.md`.

- Cross-checked sibling analyses: `model-compare/design-analysis/gpt-5-mini-design-analysis.md`, `model-compare/design-analysis/grok-code-fast-1-design-analysis.md`, and `model-compare copy/design-analysis/gemini-3-flash-preview-design-analysis.md`.

- Reviewed current external guidance on structured output, Vault, and GitHub Actions triggers.

## Web research summary

- Structured output guidance still favors schema-constrained JSON with validation before use.

- Vault guidance emphasizes secret isolation, identity-based access, dynamic credentials, and auditability.

- GitHub Actions supports `schedule`, `issues`, `issue_comment`, `workflow_dispatch`, `workflow_run`, and `repository_dispatch`, enabling staged automation.

- ArXiv search was noisy, so repository-held synthesis was used for the consensus/validation portion.

## Critical analysis — strengths

- The proposed pipeline order is sound: deterministic fetch/extraction, LLM normalization, validation, then review/commit.

- Using Karpathy Wiki material as prompt seed aligns with the repo’s documented governance model.

- Vault is the correct primary secrets store because it keeps secrets out of Git and prompts.

- A lock/unlock table with edit modals is a strong UX choice for explicit, auditable secret changes.

## Critical analysis — risks

- **LLM consensus is not proof of truth.** Consensus should rank candidate normalizations, not replace authoritative source verification.

Copilot · 2026-04-21T04:11:04Z

+      const { getSummary } = require('./github-api');
+      // Retry on transient failures (gh CLI sometimes returns null)
+      const maxAttempts = 3;
+      let summary = null;
+      for (let attempt = 1; attempt <= maxAttempts; attempt++) {
+        try {
+          summary = getSummary();
+        } catch (e) {
+          summary = null;
+        }
+        if (summary && Object.keys(summary).length) break;
+        // exponential-ish backoff
+        await new Promise(r => setTimeout(r, 150 * attempt));
+      }
+      if (!summary) return jsonRes(res, 503, { error: 'github_unavailable', message: 'GitHub API unavailable or unauthenticated (gh CLI).' });
+      return jsonRes(res, 200, summary);


The retry/backoff loop as written won’t actually detect the “gh CLI returned null/unauthenticated” case because getSummary() always returns an object with keys, so Object.keys(summary).length is always truthy and the loop breaks on the first attempt. Consider changing getSummary() to return null (or throw) when core calls fail (e.g., repo info is null), or update the retry condition to check a specific signal (like repo.default_branch present) before breaking / deciding between 200 vs 503.

Copilot · 2026-04-21T04:11:05Z

+    const r = await fetch('/api/github/summary', { signal: controller.signal });
+    clearTimeout(to);
+    if (!r.ok) {
+      const body = await r.json().catch(()=>({}));
+      _ghCache = { error: body.error || 'github_error', message: body.message || `status:${r.status}` };
+    } else {
+      _ghCache = await r.json();


The fetch timeout is only cleared on the success path; if fetch() throws/rejects (including AbortError), the setTimeout callback may still fire later. Wrap the fetch in a try/finally (or clear the timeout in a finally) so the timer is always cleaned up.

Suggested change

const r = await fetch('/api/github/summary', { signal: controller.signal });

clearTimeout(to);

if (!r.ok) {

const body = await r.json().catch(()=>({}));

_ghCache = { error: body.error || 'github_error', message: body.message || `status:${r.status}` };

} else {

_ghCache = await r.json();

try {

const r = await fetch('/api/github/summary', { signal: controller.signal });

if (!r.ok) {

const body = await r.json().catch(()=>({}));

_ghCache = { error: body.error || 'github_error', message: body.message || `status:${r.status}` };

} else {

_ghCache = await r.json();

}

} finally {

clearTimeout(to);

Copilot · 2026-04-21T04:11:05Z

+## Summary Table
+| Area | Prior suggestion | Critical assessment | Recommendation |
+|---|---|---|---|
+| Table schema | Wide, feature-rich table | Risk of horizontal overflow and low scannability | Use a strict core-column set + expandable row details |
+| Auto-update orchestration | Multi-LLM + reviewer + auto-commit | Strong quality potential but high complexity/cost/failure surface | Phase rollout: single strict pipeline first, then add consensus tier |
+| Prompt-builder | Code-generated prompts | Correct approach; must be versioned and testable | Prompt templates with schema-version pin + golden tests |
+| JSON validation + resubmit | Retry loop per model | Correct; can loop on semantic errors | Two-stage validation: schema validation + semantic validators |
+| Review layer | Third LLM reviewer | Good for confidence, weak if no deterministic gates | Reviewer is advisory; deterministic policy engine remains final gate |
+| Secret editing UX | Lock/unlock + per-resource modal | Good governance; poor if session lock state leaks | Default lock on every render + short unlock TTL + audit log |
+| Credential storage | Vault preferred | Correct industry standard | Vault KV v2 with least-privilege token + rotation workflow |
+| Commit policy | Automated commits | Useful but risky for structural changes | Auto-commit low-risk deltas; PR-required for schema/table-structure changes |
+
+## Actions Performed
+1. Reviewed repository research/doc instructions and wiki operation conventions.
+2. Reviewed local Karpathy wiki content for routing/context/governance alignment.
+3. Performed external web research on:
+   - Structured outputs + strict schema controls
+   - Tool-use consistency and output-guardrail patterns
+   - Enterprise secret management baseline
+4. Synthesized critical design tradeoffs and failure modes.
+
+## Internal Karpathy Wiki Utilization
+Used internal wiki pages as grounding context before conclusions:
+- `wiki/concepts/model-routing.md`
+- `wiki/concepts/context-flow.md`
+- `wiki/sources/karpathy-llm-wiki-pattern.md`
+- `WIKI.md` schema conventions and ingest/query/lint flow
+
+## External Research (Cutting-edge references)
+### Structured output and deterministic schema conformance
+- OpenAI: Structured Outputs + strict JSON schema + refusal handling + schema subset constraints.
+  - https://openai.com/index/introducing-structured-outputs-in-the-api/
+  - https://developers.openai.com/api/docs/guides/structured-outputs
+  - https://developers.openai.com/api/docs/guides/function-calling
+- Google Gemini: response_json_schema + schema subset + semantic validation caveat.
+  - https://ai.google.dev/gemini-api/docs/structured-output
+- Vertex AI MaaS: structured output support for open models and JSON-mode caveat.
+  - https://docs.cloud.google.com/vertex-ai/generative-ai/docs/maas/capabilities/structured-output
+
+### Consistency and guardrail patterns
+- Anthropic: strict tool use and output consistency guardrail patterns.
+  - https://platform.claude.com/docs/en/docs/agents-and-tools/tool-use/overview
+  - https://platform.claude.com/docs/en/docs/test-and-evaluate/strengthen-guardrails/increase-consistency


This new markdown file is >100 lines (lint-enforced limit for .md files) and will fail npm run lint in CI. Please split into multiple ≤100-line files or move the long-form content into an archived format that’s compliant with the lint rule.

Suggested change

## Summary Table

| Area | Prior suggestion | Critical assessment | Recommendation |

|---|---|---|---|

| Table schema | Wide, feature-rich table | Risk of horizontal overflow and low scannability | Use a strict core-column set + expandable row details |

| Auto-update orchestration | Multi-LLM + reviewer + auto-commit | Strong quality potential but high complexity/cost/failure surface | Phase rollout: single strict pipeline first, then add consensus tier |

| Prompt-builder | Code-generated prompts | Correct approach; must be versioned and testable | Prompt templates with schema-version pin + golden tests |

| JSON validation + resubmit | Retry loop per model | Correct; can loop on semantic errors | Two-stage validation: schema validation + semantic validators |

| Review layer | Third LLM reviewer | Good for confidence, weak if no deterministic gates | Reviewer is advisory; deterministic policy engine remains final gate |

| Secret editing UX | Lock/unlock + per-resource modal | Good governance; poor if session lock state leaks | Default lock on every render + short unlock TTL + audit log |

| Credential storage | Vault preferred | Correct industry standard | Vault KV v2 with least-privilege token + rotation workflow |

| Commit policy | Automated commits | Useful but risky for structural changes | Auto-commit low-risk deltas; PR-required for schema/table-structure changes |

## Actions Performed

1. Reviewed repository research/doc instructions and wiki operation conventions.

2. Reviewed local Karpathy wiki content for routing/context/governance alignment.

3. Performed external web research on:

- Structured outputs + strict schema controls

- Tool-use consistency and output-guardrail patterns

- Enterprise secret management baseline

4. Synthesized critical design tradeoffs and failure modes.

## Internal Karpathy Wiki Utilization

Used internal wiki pages as grounding context before conclusions:

- `wiki/concepts/model-routing.md`

- `wiki/concepts/context-flow.md`

- `wiki/sources/karpathy-llm-wiki-pattern.md`

- `WIKI.md` schema conventions and ingest/query/lint flow

## External Research (Cutting-edge references)

### Structured output and deterministic schema conformance

- OpenAI: Structured Outputs + strict JSON schema + refusal handling + schema subset constraints.

- https://openai.com/index/introducing-structured-outputs-in-the-api/

- https://developers.openai.com/api/docs/guides/structured-outputs

- https://developers.openai.com/api/docs/guides/function-calling

- Google Gemini: response_json_schema + schema subset + semantic validation caveat.

- https://ai.google.dev/gemini-api/docs/structured-output

- Vertex AI MaaS: structured output support for open models and JSON-mode caveat.

- https://docs.cloud.google.com/vertex-ai/generative-ai/docs/maas/capabilities/structured-output

### Consistency and guardrail patterns

- Anthropic: strict tool use and output consistency guardrail patterns.

- https://platform.claude.com/docs/en/docs/agents-and-tools/tool-use/overview

- https://platform.claude.com/docs/en/docs/test-and-evaluate/strengthen-guardrails/increase-consistency

> Condensed to satisfy the repository Markdown lint limit (`≤100` lines per file).

## Summary Table

| Area | Assessment | Recommendation |

|---|---|---|

| Table schema | Wide tables reduce scanability and can overflow | Keep core columns only; move extra fields to expandable row details |

| Auto-update orchestration | Multi-LLM review improves quality but adds cost and failure modes | Start with one strict pipeline; add consensus/reviewer stages later |

| Prompt builder | Generated prompts are appropriate if versioned and testable | Pin schema versions and add golden tests |

| Validation | Schema-only retries miss semantic errors | Use schema validation plus semantic validators |

| Review layer | Reviewer-only gating is insufficient | Keep deterministic policy checks as final gate |

| Secret editing UX | Unlock flows can leak risky session state | Default to locked state, short TTL, and audit logging |

| Credential storage | Vault is the right baseline | Use Vault KV v2 with least-privilege access and rotation |

| Commit policy | Full automation is risky for structural changes | Auto-commit low-risk deltas; require PRs for schema/table changes |

## Actions Performed

1. Reviewed repo documentation and wiki conventions.

2. Reviewed internal routing/context/governance references.

3. Compared current design against structured-output, guardrail, and secret-management guidance.

4. Synthesized failure modes and rollout recommendations.

## Internal References

- `wiki/concepts/model-routing.md`

- `wiki/concepts/context-flow.md`

- `wiki/sources/karpathy-llm-wiki-pattern.md`

- `WIKI.md`

## External Research

### Structured output

- OpenAI: https://openai.com/index/introducing-structured-outputs-in-the-api/

- OpenAI docs: https://developers.openai.com/api/docs/guides/structured-outputs

- OpenAI function calling: https://developers.openai.com/api/docs/guides/function-calling

- Gemini structured output: https://ai.google.dev/gemini-api/docs/structured-output

- Vertex AI MaaS structured output: https://docs.cloud.google.com/vertex-ai/generative-ai/docs/maas/capabilities/structured-output

### Guardrails and consistency

- Anthropic tool use: https://platform.claude.com/docs/en/docs/agents-and-tools/tool-use/overview

- Anthropic consistency guardrails: https://platform.claude.com/docs/en/docs/test-and-evaluate/strengthen-guardrails/increase-consistency

Copilot · 2026-04-21T04:11:06Z

+# Critical Analysis — Fleet Resource Table Design
+**Model**: Claude Sonnet 4.6 | **Date**: 2026-04-20
+**Ref**: GPT-5.3-Codex analysis (same folder), GH Epic #331
+
+---
+
+## Grounding Sources
+


This new markdown file is >100 lines, but the repo’s lint script enforces a hard ≤100-line limit for .md files. Split this document into smaller parts so CI lint passes.

Copilot · 2026-04-21T04:11:06Z

+      // Retry on transient failures (gh CLI sometimes returns null)
+      const maxAttempts = 3;
+      let summary = null;
+      for (let attempt = 1; attempt <= maxAttempts; attempt++) {
+        try {
+          summary = getSummary();
+        } catch (e) {
+          summary = null;
+        }
+        if (summary && Object.keys(summary).length) break;
+        // exponential-ish backoff
+        await new Promise(r => setTimeout(r, 150 * attempt));
+      }
+      if (!summary) return jsonRes(res, 503, { error: 'github_unavailable', message: 'GitHub API unavailable or unauthenticated (gh CLI).' });
+      return jsonRes(res, 200, summary);
+    } catch (e) {
+      return jsonRes(res, 500, { error: e.message });


This retry can significantly increase event-loop blocking because getSummary() uses multiple execSync calls. Retrying up to 3 times can make the dashboard server unresponsive to other requests for much longer under failure conditions. Prefer making github-api calls async (spawn/exec with callbacks) or at least avoid retries when the failure mode is likely persistent (e.g., missing auth), and log/return 503 quickly.

Suggested change

// Retry on transient failures (gh CLI sometimes returns null)

const maxAttempts = 3;

let summary = null;

for (let attempt = 1; attempt <= maxAttempts; attempt++) {

try {

summary = getSummary();

} catch (e) {

summary = null;

}

if (summary && Object.keys(summary).length) break;

// exponential-ish backoff

await new Promise(r => setTimeout(r, 150 * attempt));

}

if (!summary) return jsonRes(res, 503, { error: 'github_unavailable', message: 'GitHub API unavailable or unauthenticated (gh CLI).' });

return jsonRes(res, 200, summary);

} catch (e) {

return jsonRes(res, 500, { error: e.message });

const summary = getSummary();

if (!summary || !Object.keys(summary).length) {

return jsonRes(res, 503, { error: 'github_unavailable', message: 'GitHub API unavailable or unauthenticated (gh CLI).' });

}

return jsonRes(res, 200, summary);

} catch (e) {

console.warn('GitHub summary unavailable:', e.message);

return jsonRes(res, 503, { error: 'github_unavailable', message: 'GitHub API unavailable or unauthenticated (gh CLI).' });

Copilot · 2026-04-21T04:11:06Z

+  if (gh && gh.error) return `<div class="gh-error">
+    <span style="font-size:1.5rem">⚠️</span>
+    <p>GitHub data unavailable</p>
+    <p style="font-size:0.72rem;color:var(--text-muted)">${esc(gh.message || gh.error)}</p>
+    <button class="gh-retry">Retry</button>
+  </div>`;


The error-state HTML renders a “Retry” button (.gh-retry) but there’s no click handler anywhere in the dashboard JS, so the button won’t do anything. Either wire up an event handler (e.g., clear _ghCache / force a refresh cycle on click) or remove the button to avoid a dead control.

Copilot · 2026-04-21T04:11:06Z

+  openrouter_gemma4b: chat('https://openrouter.ai/api/v1/chat/completions', process.env.OPENROUTER_API_KEY, 'google/gemma-3-4b-it:free'),
+  openrouter_nemotron: chat('https://openrouter.ai/api/v1/chat/completions', process.env.OPENROUTER_API_KEY, 'nvidia/nemotron-3-super-120b-a12b:free'),
+  openrouter_llama70b: chat('https://openrouter.ai/api/v1/chat/completions', process.env.OPENROUTER_API_KEY, 'meta-llama/llama-3.3-70b-instruct:free'),
+  openrouter_hermes405b: chat('https://openrouter.ai/api/v1/chat/completions', process.env.OPENROUTER_API_KEY, 'nvidia/nemotron-3-nano-30b-a3b:free'),


Provider key openrouter_hermes405b is configured with model nvidia/nemotron-3-nano-30b-a3b:free, which doesn’t match the key/label expectations elsewhere (e.g., matrix labels treat this as a Hermes 405B entry). This mismatch will make run results hard to interpret and can cause the “best-of” aggregator to attribute scores to the wrong model. Please align the provider key, displayed label, and actual model id (either rename the key or change the model).

Suggested change

openrouter_hermes405b: chat('https://openrouter.ai/api/v1/chat/completions', process.env.OPENROUTER_API_KEY, 'nvidia/nemotron-3-nano-30b-a3b:free'),

openrouter_hermes405b: chat('https://openrouter.ai/api/v1/chat/completions', process.env.OPENROUTER_API_KEY, 'nousresearch/hermes-3-llama-3.1-405b:free'),

Curtis Franks added 7 commits April 20, 2026 20:25

fix: build-final no longer regenerates stale empirical section

c5e9e4b

updateMatrix() now strips any EMPIRICAL-RESULTS-START block and updates the Empirical column directly in the Dynamic Tracking Table rows. Section B is permanently retired — the unified table is the source of truth.

docs(research): analyze matrix impact on harness router #340

6d4f48d

chore: bump version to 3.1.0 and update changelog

0c09525

Copilot AI review requested due to automatic review settings April 21, 2026 04:05

Copilot started reviewing on behalf of chf3198 April 21, 2026 04:05 View session

feat: close tickets 98 and 118 with research docs and prototypes

da06a49

- Close #98: Added research/copilot-chat-hooks-research.md with adapter designs - Close #118: Confirmed model in research/ticket-status-model.md and prototype in scripts/ - Updated ticket statuses in event log

chatgpt-codex-connector Bot reviewed Apr 21, 2026

View reviewed changes

Copilot AI reviewed Apr 21, 2026

View reviewed changes

		"backend": "sonnet",
		"recommendedModel": "Claude Sonnet 4.6",

	openrouter_hermes405b: chat('https://openrouter.ai/api/v1/chat/completions', process.env.OPENROUTER_API_KEY, 'nvidia/nemotron-3-nano-30b-a3b:free'),
	openrouter_hermes405b: chat('https://openrouter.ai/api/v1/chat/completions', process.env.OPENROUTER_API_KEY, 'nousresearch/hermes-3-llama-3.1-405b:free'),

Conversation

chf3198 commented Apr 21, 2026

Changes

Validation

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants