Skip to content

feat: ensemble decision pipeline — panel, arbiter, judge, ChromaDB feedback#721

Open
Z0mb13V1 wants to merge 1 commit intomindcraft-bots:developfrom
Z0mb13V1:feat/pr2-ensemble-pipeline
Open

feat: ensemble decision pipeline — panel, arbiter, judge, ChromaDB feedback#721
Z0mb13V1 wants to merge 1 commit intomindcraft-bots:developfrom
Z0mb13V1:feat/pr2-ensemble-pipeline

Conversation

@Z0mb13V1
Copy link

@Z0mb13V1 Z0mb13V1 commented Mar 4, 2026

Ensemble Decision Pipeline

Adds a 4-model parallel consensus pipeline that improves action quality by aggregating multiple LLM opinions before acting.

Architecture

\
User input → Panel (4x parallel LLMs) → Arbiter (heuristic) → [Judge] → Action

ChromaDB feedback
\\

New Files

File Description
\src/ensemble/panel.js\ Queries 4 LLMs in parallel, collects candidate actions
\src/ensemble/arbiter.js\ Heuristic scoring; escalates to judge when margin < 0.08
\src/ensemble/judge.js\ Gemini Flash LLM judge for close decisions (10s timeout)
\src/ensemble/feedback.js\ ChromaDB loop — embeds context, retrieves past decisions (similarity > 0.6)
\src/ensemble/controller.js\ Orchestrates Panel → Arbiter → Judge flow
\src/ensemble/logger.js\ Structured decision logging

Why This Matters

Single-model Mindcraft bots frequently make locally-optimal but globally-poor decisions. Ensemble voting reduces hallucination rate and improves long-horizon task completion.

Adds a 4-model parallel decision pipeline to replace single-model responses:

- src/ensemble/panel.js: queries 4 LLMs in parallel, collects candidate actions
- src/ensemble/arbiter.js: heuristic scoring with 0.08 margin escalation to judge
- src/ensemble/judge.js: Gemini Flash LLM judge for close decisions (10s timeout)
- src/ensemble/feedback.js: ChromaDB feedback loop — embeds context, retrieves
  similar past decisions (similarity > 0.6), injects as [PAST EXPERIENCE]
- src/ensemble/controller.js: orchestrates the full Panel -> Arbiter -> Judge flow
- src/ensemble/logger.js: structured ensemble decision logging
Copilot AI review requested due to automatic review settings March 4, 2026 01:01
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an “ensemble decision pipeline” intended to act as a drop-in replacement for the existing single-LLM chat_model, by running multiple LLMs in parallel (Panel), selecting via a heuristic scorer (Arbiter), optionally escalating close calls to an LLM judge (Judge), and persisting/querying past decisions via ChromaDB (Feedback), with structured logging.

Changes:

  • Added Panel to query multiple configured models concurrently with per-model timeouts and command extraction.
  • Added Arbiter heuristic scoring + “low confidence” signal for judge escalation.
  • Added LLMJudge, FeedbackCollector (ChromaDB), and EnsembleLogger, orchestrated by EnsembleModel in controller.js.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
src/ensemble/panel.js Parallel panel querying with timeouts; extracts command + pre-command text from responses.
src/ensemble/arbiter.js Heuristic scoring, majority bonus, latency penalty, and low-confidence thresholding.
src/ensemble/judge.js LLM-based judge used when arbiter confidence is low.
src/ensemble/feedback.js ChromaDB-backed vector “experience” storage + similarity retrieval.
src/ensemble/controller.js Orchestrates Panel → Arbiter → (optional) Judge flow; injects “past experiences”; aggregates usage.
src/ensemble/logger.js Persists decision logs and basic stats under ./bots/<agent>/ensemble_log.json.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +10 to +16
constructor(config = {}) {
this.strategy = config.strategy || 'heuristic';
this.majorityBonus = config.majority_bonus ?? 0.2;
this.latencyPenalty = config.latency_penalty_per_sec ?? 0.02;
this._confidenceThreshold = config.confidence_threshold ?? 0.08;
this._lastConfidence = 1.0; // set after each pick()
}
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config.strategy is stored on the instance but never used by the arbiter logic. If strategy selection is planned, implement it; otherwise remove it to avoid dead config options and reduce maintenance overhead.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +5
import { ChromaClient } from 'chromadb';

const COLLECTION_NAME = 'ensemble_memory';
const CHROMADB_URL = process.env.CHROMADB_URL || 'http://localhost:8000';

Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chromadb is imported at module top-level, but the dependency is not present in package.json. This will throw on startup even if ChromaDB is meant to be optional. Consider adding chromadb to dependencies, or switching to a lazy/dynamic import inside _initAsync() so the app can run without it.

Copilot uses AI. Check for mistakes.
Comment on lines +15 to +18
export class EnsembleModel {
static prefix = 'ensemble';

/**
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EnsembleModel declares static prefix = 'ensemble', but it lives under src/ensemble/ and is not discoverable by src/models/_model_map.js (which only auto-loads models from src/models/). As-is, selectAPI/createModel cannot construct this model from a profile, so it isn't actually a drop-in chat_model replacement. Consider moving/exporting this model from src/models/ (e.g., src/models/ensemble.js) or extending _model_map.js to include it.

Copilot uses AI. Check for mistakes.
Comment on lines +29 to +36
* @param {Proposal[]} proposals - successful proposals only
* @param {string} systemMessage - the original system prompt (abbreviated)
* @param {Array} turns - last few conversation turns for context
* @returns {Promise<string|null>} winning agentId, or null if judge fails
*/
async judge(proposals, systemMessage, turns) {
if (proposals.length === 0) return null;
if (proposals.length === 1) return proposals[0].agentId;
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The systemMessage parameter is documented and passed into judge(), but it's not used anywhere in the method. Either incorporate it into the judge prompt (if intended) or remove the parameter to avoid confusion and keep the API honest.

Copilot uses AI. Check for mistakes.

// Phase 2: LLM judge fallback when heuristic confidence is low
if (this.judge && this.arbiter.isLowConfidence() && successful.length >= 2) {
console.log(`[Ensemble] Low confidence (margin=${this.arbiter._lastConfidence.toFixed(3)}), consulting LLM judge...`);
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log line reaches into this.arbiter._lastConfidence directly. Since _lastConfidence is effectively internal state, consider exposing it via a getter (e.g., lastConfidence) to avoid controller↔arbiter coupling on a private field name.

Suggested change
console.log(`[Ensemble] Low confidence (margin=${this.arbiter._lastConfidence.toFixed(3)}), consulting LLM judge...`);
console.log('[Ensemble] Low confidence detected, consulting LLM judge...');

Copilot uses AI. Check for mistakes.
* @property {number} latencyMs - Time taken for this model's response
* @property {string} status - "success" | "error" | "timeout"
* @property {string|null} error - Error message if status !== "success"
* @property {number|null} score - Set by Arbiter
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Proposal typedef doesn't mention the usage field, but _queryMember() returns usage: member.model._lastUsage || null. Either add usage to the typedef or remove it from the returned object to keep the shape consistent/documented.

Suggested change
* @property {number|null} score - Set by Arbiter
* @property {number|null} score - Set by Arbiter
* @property {string|null} usage - Last recorded usage information for this model, or null

Copilot uses AI. Check for mistakes.

// Log decision
if (this.logDecisions) {
this.logger.logDecision(proposals, winner);
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logDecision() is async but is called without await/.catch(). If the logger's _ready promise rejects (e.g., mkdir fails), this becomes an unhandled promise rejection. Either await the call, or explicitly handle it with .catch()/void plus internal error handling to keep logging failures from surfacing.

Suggested change
this.logger.logDecision(proposals, winner);
try {
await this.logger.logDecision(proposals, winner);
} catch (err) {
console.warn('[Ensemble] Failed to log decision:', err);
}

Copilot uses AI. Check for mistakes.
Comment on lines +23 to +66
const successful = allProposals.filter(p => p.status === 'success');
const commands = successful.map(p => p.command).filter(Boolean);
const uniqueCommands = [...new Set(commands)];
const agreement = uniqueCommands.length <= 1 && commands.length > 0
? 1.0
: commands.length > 0
? Math.max(...uniqueCommands.map(c => commands.filter(x => x === c).length)) / commands.length
: 0;

const entry = {
timestamp: new Date().toISOString(),
decision_id: this.decisionCount,
proposals: allProposals.map(p => ({
agent_id: p.agentId,
model: p.modelName,
status: p.status,
command: p.command || null,
pre_text: p.preCommandText ? p.preCommandText.slice(0, 100) : '',
score: p.score ?? null,
latency_ms: p.latencyMs,
error: p.error || null
})),
winner: winner ? {
agent_id: winner.agentId,
command: winner.command,
score: winner.score,
reason: winner.winReason || 'highest_score'
} : null,
majority_command: this._findMajority(commands),
panel_agreement: Math.round(agreement * 100) / 100
};

await this._ready;
let log = await this._readLog();
log.push(entry);

if (log.length > MAX_ENTRIES) {
log = log.slice(log.length - TRIM_TO);
}

try {
await writeFile(this.filePath, JSON.stringify(log, null, 2));
} catch (err) {
console.error(`[Ensemble] Failed to write log: ${err.message}`);
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logDecision() awaits this._ready but doesn't wrap the overall method in a try/catch. If directory creation fails, await this._ready will reject and propagate to the caller. Since logging is typically best-effort, consider catching _ready/file I/O errors so logDecision() never rejects.

Suggested change
const successful = allProposals.filter(p => p.status === 'success');
const commands = successful.map(p => p.command).filter(Boolean);
const uniqueCommands = [...new Set(commands)];
const agreement = uniqueCommands.length <= 1 && commands.length > 0
? 1.0
: commands.length > 0
? Math.max(...uniqueCommands.map(c => commands.filter(x => x === c).length)) / commands.length
: 0;
const entry = {
timestamp: new Date().toISOString(),
decision_id: this.decisionCount,
proposals: allProposals.map(p => ({
agent_id: p.agentId,
model: p.modelName,
status: p.status,
command: p.command || null,
pre_text: p.preCommandText ? p.preCommandText.slice(0, 100) : '',
score: p.score ?? null,
latency_ms: p.latencyMs,
error: p.error || null
})),
winner: winner ? {
agent_id: winner.agentId,
command: winner.command,
score: winner.score,
reason: winner.winReason || 'highest_score'
} : null,
majority_command: this._findMajority(commands),
panel_agreement: Math.round(agreement * 100) / 100
};
await this._ready;
let log = await this._readLog();
log.push(entry);
if (log.length > MAX_ENTRIES) {
log = log.slice(log.length - TRIM_TO);
}
try {
await writeFile(this.filePath, JSON.stringify(log, null, 2));
} catch (err) {
console.error(`[Ensemble] Failed to write log: ${err.message}`);
try {
const successful = allProposals.filter(p => p.status === 'success');
const commands = successful.map(p => p.command).filter(Boolean);
const uniqueCommands = [...new Set(commands)];
const agreement = uniqueCommands.length <= 1 && commands.length > 0
? 1.0
: commands.length > 0
? Math.max(...uniqueCommands.map(c => commands.filter(x => x === c).length)) / commands.length
: 0;
const entry = {
timestamp: new Date().toISOString(),
decision_id: this.decisionCount,
proposals: allProposals.map(p => ({
agent_id: p.agentId,
model: p.modelName,
status: p.status,
command: p.command || null,
pre_text: p.preCommandText ? p.preCommandText.slice(0, 100) : '',
score: p.score ?? null,
latency_ms: p.latencyMs,
error: p.error || null
})),
winner: winner ? {
agent_id: winner.agentId,
command: winner.command,
score: winner.score,
reason: winner.winReason || 'highest_score'
} : null,
majority_command: this._findMajority(commands),
panel_agreement: Math.round(agreement * 100) / 100
};
await this._ready;
let log = await this._readLog();
log.push(entry);
if (log.length > MAX_ENTRIES) {
log = log.slice(log.length - TRIM_TO);
}
try {
await writeFile(this.filePath, JSON.stringify(log, null, 2));
} catch (err) {
console.error(`[Ensemble] Failed to write log: ${err.message}`);
}
} catch (err) {
console.error(`[Ensemble] Failed to log decision: ${err.message}`);

Copilot uses AI. Check for mistakes.
Comment on lines +66 to +72
try {
const result = await Promise.race([
model.sendRequest(judgeTurns, judgeSystem),
new Promise((_, reject) =>
setTimeout(() => reject(new Error('judge timeout')), this.timeoutMs)
)
]);
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The judge timeout uses setTimeout(...) inside Promise.race, but the timer isn't cleared when the model responds first. This keeps extra timers alive and can accumulate under load. Consider keeping the timer handle and clearTimeout() after the race settles (similar to Panel._queryMember).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants