Skip to content

feat: integrate router policy v1.0.0 with tests and CI (#341)#342

Open
chf3198 wants to merge 8 commits intomainfrom
feat/341-router-policy-integration
Open

feat: integrate router policy v1.0.0 with tests and CI (#341)#342
chf3198 wants to merge 8 commits intomainfrom
feat/341-router-policy-integration

Conversation

@chf3198
Copy link
Copy Markdown
Owner

@chf3198 chf3198 commented Apr 21, 2026

This PR integrates the router policy spec with comprehensive testing and CI validation.

Changes

  • Router Policy v1.0.0: Added versioned policy spec to router.agent.md with keyword-based lane classification and escalation rules
  • Unit Tests: Created tests/router-policy.test.js with 5 test cases covering all lanes
  • CI Integration: Added test job to .github/workflows/lint.yml to run router policy tests on PR/push
  • Dashboard Improvements: Enhanced GitHub panel with server-side retry/backoff and client-side error handling
  • Version Bump: Updated to v3.1.0 with changelog entry

Validation

Closes #341

Curtis Franks added 7 commits April 20, 2026 20:25
- Merge empirical results INTO Dynamic Tracking Table (single unified view)
  - Add Empirical Composite column to all rows
  - Expand 6 aggregate free-tier rows → 17 individual per-model rows
  - Copilot Pro rows retain full analytical rubric; free-tier rows show empirical scores
  - Remove disconnected Section B (EMPIRICAL-RESULTS-START block)
- Fix OpenClaw endpoint: port 4000 (LiteLLM proxy), OPENCLAW_DEVICE_PASSWORD auth
  - Correct model names: ollama/mistral, ollama/qwen2.5:7b-instruct
  - OpenClaw rows marked fleet offline (LiteLLM service not listening on port 4000)
- Split providers file: fleet providers → ai-matrix-providers-fleet.js (lint compliance)
- Update methodology section to document dual-rubric approach

Closes #339
…wup)

- Fix provider config: direct Ollama port 11434 (no LiteLLM proxy), stream:true
  for fleet models, max_tokens:300 cap, auth: Bearer ollama
- Add phi3:mini as third fleet provider (3.8B, fastest at ~33s/300tok)
- Fix updater: AbortController timeout (360s fleet / 45s cloud), SSE streaming
  reassembly for fleet responses
- Live empirical scores:
  - openclaw_mistral: 5.8 composite (slow ~105s/100tok, strong accuracy)
  - openclaw_phi3:   6.8 composite (33s/300tok, best speed)
  - openclaw_qwen:   8.0 composite (best quality, ~7B on CPU)
- Update matrix: replace fleet-offline markers with live scores
- Fix inventory/services.json: correct OpenClaw entry (Ollama direct, no LiteLLM)
updateMatrix() now strips any EMPIRICAL-RESULTS-START block and updates
the Empirical column directly in the Dynamic Tracking Table rows.
Section B is permanently retired — the unified table is the source of truth.
Removed Prop Score, Variability, and Emergent Property columns which were
only meaningful for Copilot Pro rows. All models now scored on identical
columns: Arch | Sec | UX/UI | Read. | Empirical | Best Use Case.
Rows sorted by empirical score (high to low) within each tier.
- Add versioned router policy spec to router.agent.md
- Create router-policy.test.js with 5 test cases
- Add test:router script to package.json
- Add test job to CI workflow for router policy validation
- Improve dashboard GitHub panel with retry/backoff and error UI
- Update ticket statuses in event log
Copilot AI review requested due to automatic review settings April 21, 2026 04:05
- Close #98: Added research/copilot-chat-hooks-research.md with adapter designs
- Close #118: Confirmed model in research/ticket-status-model.md and prototype in scripts/
- Updated ticket statuses in event log
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0c09525307

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// Match rows containing the label text, update the empirical column (second-to-last cell before Best Use Case)
const escaped = meta.label.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
content = content.replace(
new RegExp(`(\\|[^|]*${escaped}[^|]*\\|(?:[^|]*\\|){7})([^|]*)(\\|[^|]*\\|)`, 'm'),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fix row-matching regex so empirical scores are updated

updateMatrix never updates the matrix because the replacement regex does not match the current table shape in model-compare/design-analysis/LLM-EVALUATION-MATRIX.md (8 columns). The pattern at this line consumes too many |...| groups before trying to replace the empirical cell, so content.replace(...) is a no-op for every provider. Running ai-matrix-build-final.js after new eval runs will therefore leave stale empirical values in the matrix.

Useful? React with 👍 / 👎.

openrouter_qwen3coder: { label: 'OpenRouter — qwen/qwen3-coder:free', cost: '0x / free-cloud', tier: 'OpenRouter' },
openrouter_nemotron: { label: 'OpenRouter — nvidia/nemotron-super-120b:free', cost: '0x / free-cloud', tier: 'OpenRouter' },
openrouter_llama70b: { label: 'OpenRouter — llama-3.3-70b-instruct:free', cost: '0x / free-cloud', tier: 'OpenRouter' },
openrouter_hermes405b: { label: 'OpenRouter — hermes-3-llama-3.1-405b:free', cost: '0x / free-cloud', tier: 'OpenRouter' },
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Align provider label with matrix row for hermes key

The label for openrouter_hermes405b does not correspond to any row in the matrix (and does not match the model configured for that key in scripts/ai-matrix-providers.js). Because updateMatrix matches rows by exact label text, this provider's empirical score can never be written back to LLM-EVALUATION-MATRIX.md, so one provider is always skipped during final matrix generation.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Integrates a versioned router policy (v1.0.0) into the router agent docs and adds lightweight CI validation, alongside dashboard GitHub panel resilience improvements and new matrix-evaluation tooling/docs.

Changes:

  • Added router policy spec + a Node-based router policy test, and wired it into GitHub Actions.
  • Improved dashboard GitHub summary fetching with server-side retry/backoff and client-side timeout + error UI.
  • Added ai-matrix evaluation runner/scoring/provider configs and several new research/design-analysis artifacts.

Reviewed changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
agents/router.agent.md Documents router policy spec + embeds versioned policy JSON.
tests/router-policy.test.js Adds a simple CLI-style test suite for lane classification.
.github/workflows/lint.yml Adds a CI job to run router policy tests.
package.json Bumps version and adds test:router; adds node-fetch dependency.
scripts/dashboard-server.js Adds retry/backoff behavior for /api/github/summary.
dashboard/js/github-monitor.js Adds client timeout + error UI for GitHub panel.
scripts/copilot-chat-adapter.js Adds a Copilot Chat adapter skeleton.
scripts/ticket-assignment-suggester.js Adds prototype role-suggestion logic for ticket metadata.
scripts/ai-matrix-updater.js Adds controlled LLM evaluation runner and output persistence.
scripts/ai-matrix-scorer.js Adds heuristic scoring for evaluation responses.
scripts/ai-matrix-providers.js Adds provider configs and merges fleet providers.
scripts/ai-matrix-providers-fleet.js Adds local fleet provider configs (OpenClaw/Ollama).
scripts/ai-matrix-matrix-writer.js Writes empirical run results into the evaluation matrix markdown.
scripts/ai-matrix-build-final.js Aggregates best-of-run results and updates the matrix.
model-compare/design-analysis/LLM-EVALUATION-MATRIX.md Adds/initializes the evaluation matrix content.
model-compare/design-analysis/eval-prompt.md Adds the evaluation prompt used for scoring.
model-compare/design-analysis/raptor-mini-design-analysis.md Adds a design-analysis artifact.
model-compare/design-analysis/gpt-5-4-mini-design-analysis.md Adds a design-analysis artifact (currently over lint line limit).
model-compare/design-analysis/fleet-resource-table-design-analysis-gpt-5.3-codex-2026-04-20.md Adds a design-analysis artifact (currently over lint line limit).
model-compare/design-analysis/fleet-resource-table-design-analysis-claude-sonnet-4.6-2026-04-20.md Adds a design-analysis artifact (currently over lint line limit).
research/router-matrix-router-analysis.md Adds a router/matrix analysis research doc.
inventory/services.json Reformats/expands service entries.
CHANGELOG.md Adds 3.1.0 release notes (currently over lint line limit).
tickets/98-copilot-chat-hooks-research.md Adds ticket tracking doc.
tickets/100-visual-qa-audit.md Adds ticket tracking doc.
tickets/118-ticket-status-assignment-research.md Adds ticket tracking doc.
tickets/119-unknown.md Adds ticket tracking doc.

Comment thread agents/router.agent.md
Comment on lines +129 to +130
"backend": "sonnet",
"recommendedModel": "Claude Sonnet 4.6",
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new “premium” lane recommends “Claude Sonnet 4.6”, but the router’s Tier 1 handoff is labeled as “Deep Work (Opus)” and routes to the architect agent. This creates conflicting guidance in the same doc (premium/“deep work” could mean Sonnet or Opus). Please reconcile the lane→model mapping with the tier/handoff definitions (e.g., make premium map to Opus/architect, or update the Tier 1 label/model guidance accordingly).

Suggested change
"backend": "sonnet",
"recommendedModel": "Claude Sonnet 4.6",
"backend": "opus",
"recommendedModel": "Claude Opus",

Copilot uses AI. Check for mistakes.
Comment thread CHANGELOG.md
Comment on lines 1 to +10
# Changelog

## [3.0.1] - 2026-04-14
## [3.1.0] - 2026-04-20

### Added — Router Policy Integration (#341)
- **Versioned router policy** (v1.0.0) in `router.agent.md` with keyword lanes and escalation
- **Router policy tests** (`tests/router-policy.test.js`): 5 test cases validating classification
- **CI test job** in `.github/workflows/lint.yml` for router policy validation
- **Dashboard GitHub panel improvements**: Server retry/backoff (3 attempts), client timeout (6s), error UI with retry button
- **Ticket status tracking**: Event log updates for ticket lifecycle management
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file now exceeds the repo’s ≤100-line limit enforced by scripts/lint.js (it currently has >100 lines), so npm run lint will fail in CI. Please split the changelog into smaller files/sections (or move older entries to an archive file) so each checked file stays ≤100 lines.

Copilot uses AI. Check for mistakes.
Comment on lines +10 to +42
- Reviewed local Karpathy LLM Wiki / governance material using repo search:
- `instructions/wiki-knowledge.instructions.md`
- `skills/llm-wiki-ops-portable/SKILL.md`
- `WIKI.md`
- `wiki/index.md`
- `wiki/concepts/wiki-pattern.md`
- `wiki/sources/karpathy-llm-wiki-pattern.md`
- `research/adr/007-llm-wiki-adoption.md`
- `raw/articles/karpathy-llm-wiki-pattern.md`
- Cross-checked existing design-analysis notes in sibling files:
- `model-compare/design-analysis/gpt-5-mini-design-analysis.md`
- `model-compare/design-analysis/grok-code-fast-1-design-analysis.md`
- `model-compare copy/design-analysis/gemini-3-flash-preview-design-analysis.md`
- Fetched current external references on structured-output, Vault, and GitHub Actions triggers.

## Web research summary
- OpenAI structured-output / function-calling docs redirected in the fetch tool, but the core guidance remains clear: prefer schema-constrained structured output, treat JSON as an enforced contract, and validate before consumption.
- HashiCorp Vault docs emphasize managed static secrets, identities/authentication, dynamic credentials, sensitive-data handling, and auditability. Recent 2026 HashiCorp blog posts highlight agentic runtime security, secret inventory reporting, and workload-identity federation.
- GitHub Actions workflow docs confirm the repo can use `schedule`, `issues`, `issue_comment`, `workflow_dispatch`, `workflow_run`, and `repository_dispatch` triggers. This supports an activity-triggered primary workflow plus a privileged follow-up workflow if needed.
- ArXiv search via the tool was noisy and the query endpoint returned 400, so I relied on repo-held synthesis files and the current research notes already embedded in the workspace for the consensus/validation strategy.

## Critical analysis — what is strong
- The earlier proposal is directionally right about separating concerns:
- deterministic fetch/extraction first,
- LLM normalization second,
- validation third,
- review/commit last.
- Using Karpathy Wiki material as the prompt seed is the right governance move because the repo already documents the wiki pattern as the compiled knowledge layer.
- Vault as the primary secrets store is the right security choice; it matches the repo’s security posture and keeps secrets out of Git and LLM prompts.
- A lock/unlock table with per-resource edit modals is a good UX pattern because it makes secret mutation explicit and auditable.

## Critical analysis — what is weak or risky
- **Too much trust in LLM consensus for facts.** Multiple LLMs can agree on the same wrong value. Consensus should only decide between candidate normalizations, not replace authoritative source checks.
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new markdown artifact is >100 lines (lint-enforced limit for .md files), which will cause npm run lint to fail. Split this analysis into multiple ≤100-line markdown files (e.g., part-1/part-2) or otherwise reduce length.

Suggested change
- Reviewed local Karpathy LLM Wiki / governance material using repo search:
- `instructions/wiki-knowledge.instructions.md`
- `skills/llm-wiki-ops-portable/SKILL.md`
- `WIKI.md`
- `wiki/index.md`
- `wiki/concepts/wiki-pattern.md`
- `wiki/sources/karpathy-llm-wiki-pattern.md`
- `research/adr/007-llm-wiki-adoption.md`
- `raw/articles/karpathy-llm-wiki-pattern.md`
- Cross-checked existing design-analysis notes in sibling files:
- `model-compare/design-analysis/gpt-5-mini-design-analysis.md`
- `model-compare/design-analysis/grok-code-fast-1-design-analysis.md`
- `model-compare copy/design-analysis/gemini-3-flash-preview-design-analysis.md`
- Fetched current external references on structured-output, Vault, and GitHub Actions triggers.
## Web research summary
- OpenAI structured-output / function-calling docs redirected in the fetch tool, but the core guidance remains clear: prefer schema-constrained structured output, treat JSON as an enforced contract, and validate before consumption.
- HashiCorp Vault docs emphasize managed static secrets, identities/authentication, dynamic credentials, sensitive-data handling, and auditability. Recent 2026 HashiCorp blog posts highlight agentic runtime security, secret inventory reporting, and workload-identity federation.
- GitHub Actions workflow docs confirm the repo can use `schedule`, `issues`, `issue_comment`, `workflow_dispatch`, `workflow_run`, and `repository_dispatch` triggers. This supports an activity-triggered primary workflow plus a privileged follow-up workflow if needed.
- ArXiv search via the tool was noisy and the query endpoint returned 400, so I relied on repo-held synthesis files and the current research notes already embedded in the workspace for the consensus/validation strategy.
## Critical analysis — what is strong
- The earlier proposal is directionally right about separating concerns:
- deterministic fetch/extraction first,
- LLM normalization second,
- validation third,
- review/commit last.
- Using Karpathy Wiki material as the prompt seed is the right governance move because the repo already documents the wiki pattern as the compiled knowledge layer.
- Vault as the primary secrets store is the right security choice; it matches the repo’s security posture and keeps secrets out of Git and LLM prompts.
- A lock/unlock table with per-resource edit modals is a good UX pattern because it makes secret mutation explicit and auditable.
## Critical analysis — what is weak or risky
- **Too much trust in LLM consensus for facts.** Multiple LLMs can agree on the same wrong value. Consensus should only decide between candidate normalizations, not replace authoritative source checks.
- Reviewed local wiki/governance sources: `instructions/wiki-knowledge.instructions.md`, `skills/llm-wiki-ops-portable/SKILL.md`, `WIKI.md`, `wiki/index.md`, `wiki/concepts/wiki-pattern.md`, `wiki/sources/karpathy-llm-wiki-pattern.md`, `research/adr/007-llm-wiki-adoption.md`, and `raw/articles/karpathy-llm-wiki-pattern.md`.
- Cross-checked sibling analyses: `model-compare/design-analysis/gpt-5-mini-design-analysis.md`, `model-compare/design-analysis/grok-code-fast-1-design-analysis.md`, and `model-compare copy/design-analysis/gemini-3-flash-preview-design-analysis.md`.
- Reviewed current external guidance on structured output, Vault, and GitHub Actions triggers.
## Web research summary
- Structured output guidance still favors schema-constrained JSON with validation before use.
- Vault guidance emphasizes secret isolation, identity-based access, dynamic credentials, and auditability.
- GitHub Actions supports `schedule`, `issues`, `issue_comment`, `workflow_dispatch`, `workflow_run`, and `repository_dispatch`, enabling staged automation.
- ArXiv search was noisy, so repository-held synthesis was used for the consensus/validation portion.
## Critical analysis — strengths
- The proposed pipeline order is sound: deterministic fetch/extraction, LLM normalization, validation, then review/commit.
- Using Karpathy Wiki material as prompt seed aligns with the repo’s documented governance model.
- Vault is the correct primary secrets store because it keeps secrets out of Git and prompts.
- A lock/unlock table with edit modals is a strong UX choice for explicit, auditable secret changes.
## Critical analysis — risks
- **LLM consensus is not proof of truth.** Consensus should rank candidate normalizations, not replace authoritative source verification.

Copilot uses AI. Check for mistakes.
Comment on lines +88 to +103
const { getSummary } = require('./github-api');
// Retry on transient failures (gh CLI sometimes returns null)
const maxAttempts = 3;
let summary = null;
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
try {
summary = getSummary();
} catch (e) {
summary = null;
}
if (summary && Object.keys(summary).length) break;
// exponential-ish backoff
await new Promise(r => setTimeout(r, 150 * attempt));
}
if (!summary) return jsonRes(res, 503, { error: 'github_unavailable', message: 'GitHub API unavailable or unauthenticated (gh CLI).' });
return jsonRes(res, 200, summary);
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry/backoff loop as written won’t actually detect the “gh CLI returned null/unauthenticated” case because getSummary() always returns an object with keys, so Object.keys(summary).length is always truthy and the loop breaks on the first attempt. Consider changing getSummary() to return null (or throw) when core calls fail (e.g., repo info is null), or update the retry condition to check a specific signal (like repo.default_branch present) before breaking / deciding between 200 vs 503.

Copilot uses AI. Check for mistakes.
Comment on lines +13 to +19
const r = await fetch('/api/github/summary', { signal: controller.signal });
clearTimeout(to);
if (!r.ok) {
const body = await r.json().catch(()=>({}));
_ghCache = { error: body.error || 'github_error', message: body.message || `status:${r.status}` };
} else {
_ghCache = await r.json();
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fetch timeout is only cleared on the success path; if fetch() throws/rejects (including AbortError), the setTimeout callback may still fire later. Wrap the fetch in a try/finally (or clear the timeout in a finally) so the timer is always cleaned up.

Suggested change
const r = await fetch('/api/github/summary', { signal: controller.signal });
clearTimeout(to);
if (!r.ok) {
const body = await r.json().catch(()=>({}));
_ghCache = { error: body.error || 'github_error', message: body.message || `status:${r.status}` };
} else {
_ghCache = await r.json();
try {
const r = await fetch('/api/github/summary', { signal: controller.signal });
if (!r.ok) {
const body = await r.json().catch(()=>({}));
_ghCache = { error: body.error || 'github_error', message: body.message || `status:${r.status}` };
} else {
_ghCache = await r.json();
}
} finally {
clearTimeout(to);

Copilot uses AI. Check for mistakes.
Comment on lines +6 to +48
## Summary Table
| Area | Prior suggestion | Critical assessment | Recommendation |
|---|---|---|---|
| Table schema | Wide, feature-rich table | Risk of horizontal overflow and low scannability | Use a strict core-column set + expandable row details |
| Auto-update orchestration | Multi-LLM + reviewer + auto-commit | Strong quality potential but high complexity/cost/failure surface | Phase rollout: single strict pipeline first, then add consensus tier |
| Prompt-builder | Code-generated prompts | Correct approach; must be versioned and testable | Prompt templates with schema-version pin + golden tests |
| JSON validation + resubmit | Retry loop per model | Correct; can loop on semantic errors | Two-stage validation: schema validation + semantic validators |
| Review layer | Third LLM reviewer | Good for confidence, weak if no deterministic gates | Reviewer is advisory; deterministic policy engine remains final gate |
| Secret editing UX | Lock/unlock + per-resource modal | Good governance; poor if session lock state leaks | Default lock on every render + short unlock TTL + audit log |
| Credential storage | Vault preferred | Correct industry standard | Vault KV v2 with least-privilege token + rotation workflow |
| Commit policy | Automated commits | Useful but risky for structural changes | Auto-commit low-risk deltas; PR-required for schema/table-structure changes |

## Actions Performed
1. Reviewed repository research/doc instructions and wiki operation conventions.
2. Reviewed local Karpathy wiki content for routing/context/governance alignment.
3. Performed external web research on:
- Structured outputs + strict schema controls
- Tool-use consistency and output-guardrail patterns
- Enterprise secret management baseline
4. Synthesized critical design tradeoffs and failure modes.

## Internal Karpathy Wiki Utilization
Used internal wiki pages as grounding context before conclusions:
- `wiki/concepts/model-routing.md`
- `wiki/concepts/context-flow.md`
- `wiki/sources/karpathy-llm-wiki-pattern.md`
- `WIKI.md` schema conventions and ingest/query/lint flow

## External Research (Cutting-edge references)
### Structured output and deterministic schema conformance
- OpenAI: Structured Outputs + strict JSON schema + refusal handling + schema subset constraints.
- https://openai.com/index/introducing-structured-outputs-in-the-api/
- https://developers.openai.com/api/docs/guides/structured-outputs
- https://developers.openai.com/api/docs/guides/function-calling
- Google Gemini: response_json_schema + schema subset + semantic validation caveat.
- https://ai.google.dev/gemini-api/docs/structured-output
- Vertex AI MaaS: structured output support for open models and JSON-mode caveat.
- https://docs.cloud.google.com/vertex-ai/generative-ai/docs/maas/capabilities/structured-output

### Consistency and guardrail patterns
- Anthropic: strict tool use and output consistency guardrail patterns.
- https://platform.claude.com/docs/en/docs/agents-and-tools/tool-use/overview
- https://platform.claude.com/docs/en/docs/test-and-evaluate/strengthen-guardrails/increase-consistency
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new markdown file is >100 lines (lint-enforced limit for .md files) and will fail npm run lint in CI. Please split into multiple ≤100-line files or move the long-form content into an archived format that’s compliant with the lint rule.

Suggested change
## Summary Table
| Area | Prior suggestion | Critical assessment | Recommendation |
|---|---|---|---|
| Table schema | Wide, feature-rich table | Risk of horizontal overflow and low scannability | Use a strict core-column set + expandable row details |
| Auto-update orchestration | Multi-LLM + reviewer + auto-commit | Strong quality potential but high complexity/cost/failure surface | Phase rollout: single strict pipeline first, then add consensus tier |
| Prompt-builder | Code-generated prompts | Correct approach; must be versioned and testable | Prompt templates with schema-version pin + golden tests |
| JSON validation + resubmit | Retry loop per model | Correct; can loop on semantic errors | Two-stage validation: schema validation + semantic validators |
| Review layer | Third LLM reviewer | Good for confidence, weak if no deterministic gates | Reviewer is advisory; deterministic policy engine remains final gate |
| Secret editing UX | Lock/unlock + per-resource modal | Good governance; poor if session lock state leaks | Default lock on every render + short unlock TTL + audit log |
| Credential storage | Vault preferred | Correct industry standard | Vault KV v2 with least-privilege token + rotation workflow |
| Commit policy | Automated commits | Useful but risky for structural changes | Auto-commit low-risk deltas; PR-required for schema/table-structure changes |
## Actions Performed
1. Reviewed repository research/doc instructions and wiki operation conventions.
2. Reviewed local Karpathy wiki content for routing/context/governance alignment.
3. Performed external web research on:
- Structured outputs + strict schema controls
- Tool-use consistency and output-guardrail patterns
- Enterprise secret management baseline
4. Synthesized critical design tradeoffs and failure modes.
## Internal Karpathy Wiki Utilization
Used internal wiki pages as grounding context before conclusions:
- `wiki/concepts/model-routing.md`
- `wiki/concepts/context-flow.md`
- `wiki/sources/karpathy-llm-wiki-pattern.md`
- `WIKI.md` schema conventions and ingest/query/lint flow
## External Research (Cutting-edge references)
### Structured output and deterministic schema conformance
- OpenAI: Structured Outputs + strict JSON schema + refusal handling + schema subset constraints.
- https://openai.com/index/introducing-structured-outputs-in-the-api/
- https://developers.openai.com/api/docs/guides/structured-outputs
- https://developers.openai.com/api/docs/guides/function-calling
- Google Gemini: response_json_schema + schema subset + semantic validation caveat.
- https://ai.google.dev/gemini-api/docs/structured-output
- Vertex AI MaaS: structured output support for open models and JSON-mode caveat.
- https://docs.cloud.google.com/vertex-ai/generative-ai/docs/maas/capabilities/structured-output
### Consistency and guardrail patterns
- Anthropic: strict tool use and output consistency guardrail patterns.
- https://platform.claude.com/docs/en/docs/agents-and-tools/tool-use/overview
- https://platform.claude.com/docs/en/docs/test-and-evaluate/strengthen-guardrails/increase-consistency
> Condensed to satisfy the repository Markdown lint limit (`≤100` lines per file).
## Summary Table
| Area | Assessment | Recommendation |
|---|---|---|
| Table schema | Wide tables reduce scanability and can overflow | Keep core columns only; move extra fields to expandable row details |
| Auto-update orchestration | Multi-LLM review improves quality but adds cost and failure modes | Start with one strict pipeline; add consensus/reviewer stages later |
| Prompt builder | Generated prompts are appropriate if versioned and testable | Pin schema versions and add golden tests |
| Validation | Schema-only retries miss semantic errors | Use schema validation plus semantic validators |
| Review layer | Reviewer-only gating is insufficient | Keep deterministic policy checks as final gate |
| Secret editing UX | Unlock flows can leak risky session state | Default to locked state, short TTL, and audit logging |
| Credential storage | Vault is the right baseline | Use Vault KV v2 with least-privilege access and rotation |
| Commit policy | Full automation is risky for structural changes | Auto-commit low-risk deltas; require PRs for schema/table changes |
## Actions Performed
1. Reviewed repo documentation and wiki conventions.
2. Reviewed internal routing/context/governance references.
3. Compared current design against structured-output, guardrail, and secret-management guidance.
4. Synthesized failure modes and rollout recommendations.
## Internal References
- `wiki/concepts/model-routing.md`
- `wiki/concepts/context-flow.md`
- `wiki/sources/karpathy-llm-wiki-pattern.md`
- `WIKI.md`
## External Research
### Structured output
- OpenAI: https://openai.com/index/introducing-structured-outputs-in-the-api/
- OpenAI docs: https://developers.openai.com/api/docs/guides/structured-outputs
- OpenAI function calling: https://developers.openai.com/api/docs/guides/function-calling
- Gemini structured output: https://ai.google.dev/gemini-api/docs/structured-output
- Vertex AI MaaS structured output: https://docs.cloud.google.com/vertex-ai/generative-ai/docs/maas/capabilities/structured-output
### Guardrails and consistency
- Anthropic tool use: https://platform.claude.com/docs/en/docs/agents-and-tools/tool-use/overview
- Anthropic consistency guardrails: https://platform.claude.com/docs/en/docs/test-and-evaluate/strengthen-guardrails/increase-consistency

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +8
# Critical Analysis — Fleet Resource Table Design
**Model**: Claude Sonnet 4.6 | **Date**: 2026-04-20
**Ref**: GPT-5.3-Codex analysis (same folder), GH Epic #331

---

## Grounding Sources

Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new markdown file is >100 lines, but the repo’s lint script enforces a hard ≤100-line limit for .md files. Split this document into smaller parts so CI lint passes.

Copilot uses AI. Check for mistakes.
Comment on lines +89 to +105
// Retry on transient failures (gh CLI sometimes returns null)
const maxAttempts = 3;
let summary = null;
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
try {
summary = getSummary();
} catch (e) {
summary = null;
}
if (summary && Object.keys(summary).length) break;
// exponential-ish backoff
await new Promise(r => setTimeout(r, 150 * attempt));
}
if (!summary) return jsonRes(res, 503, { error: 'github_unavailable', message: 'GitHub API unavailable or unauthenticated (gh CLI).' });
return jsonRes(res, 200, summary);
} catch (e) {
return jsonRes(res, 500, { error: e.message });
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This retry can significantly increase event-loop blocking because getSummary() uses multiple execSync calls. Retrying up to 3 times can make the dashboard server unresponsive to other requests for much longer under failure conditions. Prefer making github-api calls async (spawn/exec with callbacks) or at least avoid retries when the failure mode is likely persistent (e.g., missing auth), and log/return 503 quickly.

Suggested change
// Retry on transient failures (gh CLI sometimes returns null)
const maxAttempts = 3;
let summary = null;
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
try {
summary = getSummary();
} catch (e) {
summary = null;
}
if (summary && Object.keys(summary).length) break;
// exponential-ish backoff
await new Promise(r => setTimeout(r, 150 * attempt));
}
if (!summary) return jsonRes(res, 503, { error: 'github_unavailable', message: 'GitHub API unavailable or unauthenticated (gh CLI).' });
return jsonRes(res, 200, summary);
} catch (e) {
return jsonRes(res, 500, { error: e.message });
const summary = getSummary();
if (!summary || !Object.keys(summary).length) {
return jsonRes(res, 503, { error: 'github_unavailable', message: 'GitHub API unavailable or unauthenticated (gh CLI).' });
}
return jsonRes(res, 200, summary);
} catch (e) {
console.warn('GitHub summary unavailable:', e.message);
return jsonRes(res, 503, { error: 'github_unavailable', message: 'GitHub API unavailable or unauthenticated (gh CLI).' });

Copilot uses AI. Check for mistakes.
Comment on lines +40 to +45
if (gh && gh.error) return `<div class="gh-error">
<span style="font-size:1.5rem">⚠️</span>
<p>GitHub data unavailable</p>
<p style="font-size:0.72rem;color:var(--text-muted)">${esc(gh.message || gh.error)}</p>
<button class="gh-retry">Retry</button>
</div>`;
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error-state HTML renders a “Retry” button (.gh-retry) but there’s no click handler anywhere in the dashboard JS, so the button won’t do anything. Either wire up an event handler (e.g., clear _ghCache / force a refresh cycle on click) or remove the button to avoid a dead control.

Copilot uses AI. Check for mistakes.
openrouter_gemma4b: chat('https://openrouter.ai/api/v1/chat/completions', process.env.OPENROUTER_API_KEY, 'google/gemma-3-4b-it:free'),
openrouter_nemotron: chat('https://openrouter.ai/api/v1/chat/completions', process.env.OPENROUTER_API_KEY, 'nvidia/nemotron-3-super-120b-a12b:free'),
openrouter_llama70b: chat('https://openrouter.ai/api/v1/chat/completions', process.env.OPENROUTER_API_KEY, 'meta-llama/llama-3.3-70b-instruct:free'),
openrouter_hermes405b: chat('https://openrouter.ai/api/v1/chat/completions', process.env.OPENROUTER_API_KEY, 'nvidia/nemotron-3-nano-30b-a3b:free'),
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provider key openrouter_hermes405b is configured with model nvidia/nemotron-3-nano-30b-a3b:free, which doesn’t match the key/label expectations elsewhere (e.g., matrix labels treat this as a Hermes 405B entry). This mismatch will make run results hard to interpret and can cause the “best-of” aggregator to attribute scores to the wrong model. Please align the provider key, displayed label, and actual model id (either rename the key or change the model).

Suggested change
openrouter_hermes405b: chat('https://openrouter.ai/api/v1/chat/completions', process.env.OPENROUTER_API_KEY, 'nvidia/nemotron-3-nano-30b-a3b:free'),
openrouter_hermes405b: chat('https://openrouter.ai/api/v1/chat/completions', process.env.OPENROUTER_API_KEY, 'nousresearch/hermes-3-llama-3.1-405b:free'),

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Document router policy spec from matrix analysis

2 participants