feat: ensemble decision pipeline — panel, arbiter, judge, ChromaDB feedback#721
feat: ensemble decision pipeline — panel, arbiter, judge, ChromaDB feedback#721Z0mb13V1 wants to merge 1 commit intomindcraft-bots:developfrom
Conversation
Adds a 4-model parallel decision pipeline to replace single-model responses: - src/ensemble/panel.js: queries 4 LLMs in parallel, collects candidate actions - src/ensemble/arbiter.js: heuristic scoring with 0.08 margin escalation to judge - src/ensemble/judge.js: Gemini Flash LLM judge for close decisions (10s timeout) - src/ensemble/feedback.js: ChromaDB feedback loop — embeds context, retrieves similar past decisions (similarity > 0.6), injects as [PAST EXPERIENCE] - src/ensemble/controller.js: orchestrates the full Panel -> Arbiter -> Judge flow - src/ensemble/logger.js: structured ensemble decision logging
There was a problem hiding this comment.
Pull request overview
This PR introduces an “ensemble decision pipeline” intended to act as a drop-in replacement for the existing single-LLM chat_model, by running multiple LLMs in parallel (Panel), selecting via a heuristic scorer (Arbiter), optionally escalating close calls to an LLM judge (Judge), and persisting/querying past decisions via ChromaDB (Feedback), with structured logging.
Changes:
- Added
Panelto query multiple configured models concurrently with per-model timeouts and command extraction. - Added
Arbiterheuristic scoring + “low confidence” signal for judge escalation. - Added
LLMJudge,FeedbackCollector(ChromaDB), andEnsembleLogger, orchestrated byEnsembleModelincontroller.js.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| src/ensemble/panel.js | Parallel panel querying with timeouts; extracts command + pre-command text from responses. |
| src/ensemble/arbiter.js | Heuristic scoring, majority bonus, latency penalty, and low-confidence thresholding. |
| src/ensemble/judge.js | LLM-based judge used when arbiter confidence is low. |
| src/ensemble/feedback.js | ChromaDB-backed vector “experience” storage + similarity retrieval. |
| src/ensemble/controller.js | Orchestrates Panel → Arbiter → (optional) Judge flow; injects “past experiences”; aggregates usage. |
| src/ensemble/logger.js | Persists decision logs and basic stats under ./bots/<agent>/ensemble_log.json. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| constructor(config = {}) { | ||
| this.strategy = config.strategy || 'heuristic'; | ||
| this.majorityBonus = config.majority_bonus ?? 0.2; | ||
| this.latencyPenalty = config.latency_penalty_per_sec ?? 0.02; | ||
| this._confidenceThreshold = config.confidence_threshold ?? 0.08; | ||
| this._lastConfidence = 1.0; // set after each pick() | ||
| } |
There was a problem hiding this comment.
config.strategy is stored on the instance but never used by the arbiter logic. If strategy selection is planned, implement it; otherwise remove it to avoid dead config options and reduce maintenance overhead.
| import { ChromaClient } from 'chromadb'; | ||
|
|
||
| const COLLECTION_NAME = 'ensemble_memory'; | ||
| const CHROMADB_URL = process.env.CHROMADB_URL || 'http://localhost:8000'; | ||
|
|
There was a problem hiding this comment.
chromadb is imported at module top-level, but the dependency is not present in package.json. This will throw on startup even if ChromaDB is meant to be optional. Consider adding chromadb to dependencies, or switching to a lazy/dynamic import inside _initAsync() so the app can run without it.
| export class EnsembleModel { | ||
| static prefix = 'ensemble'; | ||
|
|
||
| /** |
There was a problem hiding this comment.
EnsembleModel declares static prefix = 'ensemble', but it lives under src/ensemble/ and is not discoverable by src/models/_model_map.js (which only auto-loads models from src/models/). As-is, selectAPI/createModel cannot construct this model from a profile, so it isn't actually a drop-in chat_model replacement. Consider moving/exporting this model from src/models/ (e.g., src/models/ensemble.js) or extending _model_map.js to include it.
| * @param {Proposal[]} proposals - successful proposals only | ||
| * @param {string} systemMessage - the original system prompt (abbreviated) | ||
| * @param {Array} turns - last few conversation turns for context | ||
| * @returns {Promise<string|null>} winning agentId, or null if judge fails | ||
| */ | ||
| async judge(proposals, systemMessage, turns) { | ||
| if (proposals.length === 0) return null; | ||
| if (proposals.length === 1) return proposals[0].agentId; |
There was a problem hiding this comment.
The systemMessage parameter is documented and passed into judge(), but it's not used anywhere in the method. Either incorporate it into the judge prompt (if intended) or remove the parameter to avoid confusion and keep the API honest.
|
|
||
| // Phase 2: LLM judge fallback when heuristic confidence is low | ||
| if (this.judge && this.arbiter.isLowConfidence() && successful.length >= 2) { | ||
| console.log(`[Ensemble] Low confidence (margin=${this.arbiter._lastConfidence.toFixed(3)}), consulting LLM judge...`); |
There was a problem hiding this comment.
This log line reaches into this.arbiter._lastConfidence directly. Since _lastConfidence is effectively internal state, consider exposing it via a getter (e.g., lastConfidence) to avoid controller↔arbiter coupling on a private field name.
| console.log(`[Ensemble] Low confidence (margin=${this.arbiter._lastConfidence.toFixed(3)}), consulting LLM judge...`); | |
| console.log('[Ensemble] Low confidence detected, consulting LLM judge...'); |
| * @property {number} latencyMs - Time taken for this model's response | ||
| * @property {string} status - "success" | "error" | "timeout" | ||
| * @property {string|null} error - Error message if status !== "success" | ||
| * @property {number|null} score - Set by Arbiter |
There was a problem hiding this comment.
The Proposal typedef doesn't mention the usage field, but _queryMember() returns usage: member.model._lastUsage || null. Either add usage to the typedef or remove it from the returned object to keep the shape consistent/documented.
| * @property {number|null} score - Set by Arbiter | |
| * @property {number|null} score - Set by Arbiter | |
| * @property {string|null} usage - Last recorded usage information for this model, or null |
|
|
||
| // Log decision | ||
| if (this.logDecisions) { | ||
| this.logger.logDecision(proposals, winner); |
There was a problem hiding this comment.
logDecision() is async but is called without await/.catch(). If the logger's _ready promise rejects (e.g., mkdir fails), this becomes an unhandled promise rejection. Either await the call, or explicitly handle it with .catch()/void plus internal error handling to keep logging failures from surfacing.
| this.logger.logDecision(proposals, winner); | |
| try { | |
| await this.logger.logDecision(proposals, winner); | |
| } catch (err) { | |
| console.warn('[Ensemble] Failed to log decision:', err); | |
| } |
| const successful = allProposals.filter(p => p.status === 'success'); | ||
| const commands = successful.map(p => p.command).filter(Boolean); | ||
| const uniqueCommands = [...new Set(commands)]; | ||
| const agreement = uniqueCommands.length <= 1 && commands.length > 0 | ||
| ? 1.0 | ||
| : commands.length > 0 | ||
| ? Math.max(...uniqueCommands.map(c => commands.filter(x => x === c).length)) / commands.length | ||
| : 0; | ||
|
|
||
| const entry = { | ||
| timestamp: new Date().toISOString(), | ||
| decision_id: this.decisionCount, | ||
| proposals: allProposals.map(p => ({ | ||
| agent_id: p.agentId, | ||
| model: p.modelName, | ||
| status: p.status, | ||
| command: p.command || null, | ||
| pre_text: p.preCommandText ? p.preCommandText.slice(0, 100) : '', | ||
| score: p.score ?? null, | ||
| latency_ms: p.latencyMs, | ||
| error: p.error || null | ||
| })), | ||
| winner: winner ? { | ||
| agent_id: winner.agentId, | ||
| command: winner.command, | ||
| score: winner.score, | ||
| reason: winner.winReason || 'highest_score' | ||
| } : null, | ||
| majority_command: this._findMajority(commands), | ||
| panel_agreement: Math.round(agreement * 100) / 100 | ||
| }; | ||
|
|
||
| await this._ready; | ||
| let log = await this._readLog(); | ||
| log.push(entry); | ||
|
|
||
| if (log.length > MAX_ENTRIES) { | ||
| log = log.slice(log.length - TRIM_TO); | ||
| } | ||
|
|
||
| try { | ||
| await writeFile(this.filePath, JSON.stringify(log, null, 2)); | ||
| } catch (err) { | ||
| console.error(`[Ensemble] Failed to write log: ${err.message}`); |
There was a problem hiding this comment.
logDecision() awaits this._ready but doesn't wrap the overall method in a try/catch. If directory creation fails, await this._ready will reject and propagate to the caller. Since logging is typically best-effort, consider catching _ready/file I/O errors so logDecision() never rejects.
| const successful = allProposals.filter(p => p.status === 'success'); | |
| const commands = successful.map(p => p.command).filter(Boolean); | |
| const uniqueCommands = [...new Set(commands)]; | |
| const agreement = uniqueCommands.length <= 1 && commands.length > 0 | |
| ? 1.0 | |
| : commands.length > 0 | |
| ? Math.max(...uniqueCommands.map(c => commands.filter(x => x === c).length)) / commands.length | |
| : 0; | |
| const entry = { | |
| timestamp: new Date().toISOString(), | |
| decision_id: this.decisionCount, | |
| proposals: allProposals.map(p => ({ | |
| agent_id: p.agentId, | |
| model: p.modelName, | |
| status: p.status, | |
| command: p.command || null, | |
| pre_text: p.preCommandText ? p.preCommandText.slice(0, 100) : '', | |
| score: p.score ?? null, | |
| latency_ms: p.latencyMs, | |
| error: p.error || null | |
| })), | |
| winner: winner ? { | |
| agent_id: winner.agentId, | |
| command: winner.command, | |
| score: winner.score, | |
| reason: winner.winReason || 'highest_score' | |
| } : null, | |
| majority_command: this._findMajority(commands), | |
| panel_agreement: Math.round(agreement * 100) / 100 | |
| }; | |
| await this._ready; | |
| let log = await this._readLog(); | |
| log.push(entry); | |
| if (log.length > MAX_ENTRIES) { | |
| log = log.slice(log.length - TRIM_TO); | |
| } | |
| try { | |
| await writeFile(this.filePath, JSON.stringify(log, null, 2)); | |
| } catch (err) { | |
| console.error(`[Ensemble] Failed to write log: ${err.message}`); | |
| try { | |
| const successful = allProposals.filter(p => p.status === 'success'); | |
| const commands = successful.map(p => p.command).filter(Boolean); | |
| const uniqueCommands = [...new Set(commands)]; | |
| const agreement = uniqueCommands.length <= 1 && commands.length > 0 | |
| ? 1.0 | |
| : commands.length > 0 | |
| ? Math.max(...uniqueCommands.map(c => commands.filter(x => x === c).length)) / commands.length | |
| : 0; | |
| const entry = { | |
| timestamp: new Date().toISOString(), | |
| decision_id: this.decisionCount, | |
| proposals: allProposals.map(p => ({ | |
| agent_id: p.agentId, | |
| model: p.modelName, | |
| status: p.status, | |
| command: p.command || null, | |
| pre_text: p.preCommandText ? p.preCommandText.slice(0, 100) : '', | |
| score: p.score ?? null, | |
| latency_ms: p.latencyMs, | |
| error: p.error || null | |
| })), | |
| winner: winner ? { | |
| agent_id: winner.agentId, | |
| command: winner.command, | |
| score: winner.score, | |
| reason: winner.winReason || 'highest_score' | |
| } : null, | |
| majority_command: this._findMajority(commands), | |
| panel_agreement: Math.round(agreement * 100) / 100 | |
| }; | |
| await this._ready; | |
| let log = await this._readLog(); | |
| log.push(entry); | |
| if (log.length > MAX_ENTRIES) { | |
| log = log.slice(log.length - TRIM_TO); | |
| } | |
| try { | |
| await writeFile(this.filePath, JSON.stringify(log, null, 2)); | |
| } catch (err) { | |
| console.error(`[Ensemble] Failed to write log: ${err.message}`); | |
| } | |
| } catch (err) { | |
| console.error(`[Ensemble] Failed to log decision: ${err.message}`); |
| try { | ||
| const result = await Promise.race([ | ||
| model.sendRequest(judgeTurns, judgeSystem), | ||
| new Promise((_, reject) => | ||
| setTimeout(() => reject(new Error('judge timeout')), this.timeoutMs) | ||
| ) | ||
| ]); |
There was a problem hiding this comment.
The judge timeout uses setTimeout(...) inside Promise.race, but the timer isn't cleared when the model responds first. This keeps extra timers alive and can accumulate under load. Consider keeping the timer handle and clearTimeout() after the race settles (similar to Panel._queryMember).
Ensemble Decision Pipeline
Adds a 4-model parallel consensus pipeline that improves action quality by aggregating multiple LLM opinions before acting.
Architecture
\
User input → Panel (4x parallel LLMs) → Arbiter (heuristic) → [Judge] → Action
↑
ChromaDB feedback
\\
New Files
Why This Matters
Single-model Mindcraft bots frequently make locally-optimal but globally-poor decisions. Ensemble voting reduces hallucination rate and improves long-horizon task completion.