Multiple AI models see your screen and debate to give you one precise, verified answer.
AI Hub V2 is a Chrome Extension (Manifest V3) that opens as a side panel alongside any webpage. When you ask a question, it dispatches your query to four different AI models simultaneously, runs a structured 4-round debate protocol, and returns a single consensus answer — far more reliable than any single model alone.
- Architecture Overview
- How It Works — The 4-Round Debate
- Project Structure
- File-by-File Breakdown
- Screen Context System
- Models Used
- Setup & Installation
- Configuration
- API Endpoints
┌────────────────────────────────────────────────┐
│ Chrome Browser │
│ │
│ ┌──────────────┐ ┌────────────────────┐ │
│ │ Active Tab │◄────│ background.js │ │
│ │ (any page) │ │ (service worker) │ │
│ └──────────────┘ └─────────┬──────────┘ │
│ │ │
│ ┌───────▼──────────┐ │
│ │ Side Panel UI │ │
│ │ sidebar.html │ │
│ │ sidebar.js │ │
│ │ sidebar.css │ │
│ └───────┬──────────┘ │
└──────────────────────────────────┼──────────────┘
│ HTTP + SSE
┌────────▼──────────┐
│ Express Server │
│ server/index.js │
└────────┬──────────┘
│
┌────────▼──────────┐
│ Orchestrator │
│ orchestrator.js │
└────────┬──────────┘
│
┌────────────────────────┼────────────────────────┐
│ │ │
┌──────▼──────┐ ┌───────────▼──────────┐ ┌────────▼────────┐
│ OpenAI │ │ Cloudflare Workers │ │ Cloudflare │
│ GPT-4o │ │ Mistral · Gemma · │ │ DeepSeek R1 │
│ (+ vision) │ │ (text-only) │ │ (text-only) │
└─────────────┘ └──────────────────────┘ └─────────────────┘
Data flow in one sentence: The sidebar captures the user's question (+ optional screenshot/page text), sends it to the Express backend via POST /api/chat, which orchestrates a 4-round multi-model debate via the orchestrator, streaming real time progress back to the sidebar over Server-Sent Events (SSE).
Every user question triggers a structured deliberation pipeline. Each round serves a distinct purpose, and all models participate in parallel within each round.
Each model receives the user's question, page context, conversation history, and screenshot analysis. They answer independently without seeing each other's responses. This maximizes diversity of thought and prevents groupthink.
Prompt role: Independent analyst
Output structure: Direct Answer → Reasoning Steps → Validation Checks → Assumptions → Confidence (0–100)
Every model receives all Round 1 answers and is tasked with auditing them for factual, logical, and arithmetic errors. Models identify hallucinated or unsupported claims and produce a corrected draft.
Prompt role: Critical reviewer
Output structure: Error Audit Per Model → Confirmed Correct Points → Disagreements & Resolution → Corrected Draft → Remaining Risks → Confidence (0–100)
Models see both Round 1 and Round 2 outputs. Their job is to converge on a single defensible answer, keeping only claims that survived cross-validation and removing or rewriting anything with unresolved uncertainty.
Prompt role: Consensus builder
Output structure: Consensus Candidate Answer → Evidence for Consensus → Rejected Claims → Unresolved Issues → Sign-off Checklist → Confidence (0–100)
A single synthesizer model (preferring GPT-4o) ingests outputs from all three prior rounds and produces the final user-facing answer. The output is clean — no mention of rounds, model names, or internal process.
Prompt role: Final synthesizer
Output: The polished, precise, final answer to the user.
Single-model fallback: If only one model is configured, the system still runs a 2-step process: initial answer → self-validation using the Round 4 prompt.
AI-HUB/
├── manifest.json # Chrome Extension manifest (V3)
├── background.js # Service worker: screenshot capture & page text extraction
├── content.js # Content script injected into all pages (placeholder for future features)
├── sidebar.html # Side panel UI markup
├── sidebar.js # Sidebar logic: chat, context handling, API calls, markdown rendering
├── sidebar.css # Full styling for the sidebar UI
├── icons/ # Extension icons (16, 48, 128px)
└── server/
├── .env # API keys (gitignored — you create this)
├── index.js # Express server: CORS, health check, SSE chat endpoint
├── orchestrator.js # 4-round multi-model debate engine
├── prompts.js # Structured prompt templates for each round
├── package.json # Node.js dependencies
└── package-lock.json # Dependency lock file
Declares the extension as Manifest V3 with these key permissions:
| Permission | Purpose |
|---|---|
sidePanel |
Opens the chat UI as a Chrome side panel |
activeTab |
Access to the currently active tab |
tabs |
Query tab metadata (title, URL) |
scripting |
Inject scripts into pages for content extraction |
storage |
Persist user settings (server URL) across sessions |
Host permissions are set to <all_urls> to allow screenshot capture and content extraction on any page.
Handles two message types from the sidebar:
-
CAPTURE_SCREENSHOT— Useschrome.tabs.captureVisibleTab()to take a JPEG screenshot (quality 85) of the active tab. Returns the base64 data URL along with the tab's title and URL. -
GET_PAGE_CONTENT— Useschrome.scripting.executeScript()to inject a function into the active tab that:- Checks for user-selected text (>50 chars) and returns it as
type: 'selection' - Otherwise extracts the main content by cloning the
<article>,<main>,[role="main"], or<body>element - Strips out
<script>,<style>,<nav>,<header>,<footer>,<aside>,<iframe>,<svg>, and ARIA-hidden elements - Returns cleaned text (up to 10,000 chars) as
type: 'page'
- Checks for user-selected text (>50 chars) and returns it as
A lightweight placeholder injected into all pages at document_idle. Currently empty — exists for future extensibility (e.g., highlight-to-ask features).
The HTML structure for the sidebar, featuring:
- Header — Logo, extension title ("AI Collab"), New Chat button, Settings button
- Chat container — Welcome state showing participating model badges (GPT-4o, Mistral 24B, Gemma 12B, DeepSeek R1), and a scrollable message area
- Input area — Textarea with a "Screen" context toggle button, send button, and a context status indicator
- Settings modal — Backend URL configuration, server connection check, and a read-only list of active models (API keys are managed server-side)
The core client-side logic, organized into these subsystems:
messages[]— Full chat history (role, content, timestamp)includePageContext— Boolean toggle for screen contextcachedScreenContext— Most recent captured contextscreenContextHistory[]— Up to 10 prior contexts for multi-turn referencesettings.serverUrl— Configurable backend URL (default:http://localhost:3001)
The system intelligently detects whether a new message is a follow-up to avoid unnecessarily re-capturing screen context:
- Cue regex — Matches openers like "and", "also", "why", "how about", etc.
- Reference regex — Matches pronouns like "it", "this", "that", "them", etc.
- Topic overlap — Extracts topic words (filtering stop words), computes Jaccard overlap; ≥35% overlap → follow-up
If a question is a follow-up, the cached screen context is reused. Otherwise, a fresh capture is initiated.
togglePageContext()— Activates/deactivates the "Screen" buttoncaptureAndCacheScreenContext()— Captures screenshot + page text in parallelresolveScreenContextForMessage()— Decides whether to refresh or reuse context based on follow-up detectionbuildPriorContexts()— Attaches up to 3 most recent prior screen contexts (excluding the current one) for multi-turn visual reasoningmapContextForModel()— Normalizes a context object for the API payload, truncating text to configurable character limits
- Sends
POSTto{serverUrl}/api/chatwith JSON body:{ message, context?, priorContexts?, history } - History includes the last 12 messages (excluding the current one)
- Handles SSE streaming: parses
data:lines forprogressevents (updates the thinking UI) andresultevents (renders the final answer)
- Markdown renderer — Converts bold, italic, headers, lists, blockquotes, links, horizontal rules, code blocks, and inline code
- LaTeX math — Converts
\[...\],$$...$$, and\(...\)notation into readable HTML with proper fractions, square roots, superscripts, subscripts, and Greek symbols - Collaboration details — A toggle ("See how models collaborated") that expands to show all 4 rounds with per-model responses, color-coded by provider
- Thinking indicator — Animated progress bar showing which round is active (R1 Independent → R2 Validate → R3 Consensus → R4 Final)
A lightweight Express server with:
- Environment loading — Reads
OPENAI_API_KEY,CLOUDFLARE_ACCOUNT_ID, andCLOUDFLARE_API_TOKENfromserver/.env GET /health— Returns server status, active providers, and model countPOST /api/chat— Main endpoint. Sets up SSE headers, passes the request to the orchestrator along with a progress callback that emitsdata:events in real time, then sends the finalresultevent- CORS enabled globally, JSON body limit set to 50MB (to support base64 screenshots)
The heart of the system. Key capabilities:
getAvailableModels() builds the model list from environment keys:
| Model | Provider | Vision Support |
|---|---|---|
| GPT-4o (mini) | OpenAI | ✅ Yes |
| Mistral Small 3.1 24B | Cloudflare Workers AI | ❌ No |
| Gemma 3 12B | Cloudflare Workers AI | ❌ No |
| DeepSeek R1 Distill 32B | Cloudflare Workers AI | ❌ No |
Since only GPT-4o supports multimodal inputs, the orchestrator:
- Pre-analyzes each screenshot via
callOpenAIVision()to produce a textual summary of on-screen content (UI layout, exact text, key entities, user intent, ambiguities) - Caches summaries using a hash-based cache (
visionSummaryCache, max 100 entries) to avoid redundant API calls - Injects the text summary into prompts for text-only models (Cloudflare models) so they also understand what's on screen
- Sends raw screenshots directly to vision-capable models (GPT-4o) as
image_urlcontent parts, alongside the text summary
This dual strategy ensures all models have access to visual context, regardless of whether they natively support images.
Each round (1–4) follows the same pattern:
- Build the round-specific prompt via
prompts.js - Call all available models in parallel via
Promise.allSettled() - Collect successful responses, gracefully ignoring failures
- Emit a progress event to the SSE stream
- Pass responses to the next round's prompt builder
The final synthesis (Round 4) is handled by a single designated model (preferring GPT-4o), not all models in parallel.
Structured prompt builder with a shared SYSTEM_IDENTITY preamble enforcing:
- Prioritize correctness over fluency
- Never hallucinate or invent facts
- Keep equations human-readable
- Show explicit logic for non-trivial conclusions
Each round function (round1 through round4) assembles a prompt from:
- System identity + round-specific role
- Conversation history (formatted as
User:/Assistant:pairs) - Current screen context (title, URL, extracted text, screenshot analysis)
- Prior screen contexts (up to 5 most recent)
- Prior round responses (for rounds 2–4)
- Structured output requirements
The screen context feature is the main differentiator — models can "see" what you're looking at. Here's the full pipeline:
User clicks "Screen" button
│
▼
captureAndCacheScreenContext()
│
├──► getPageContent() ──► background.js ──► executeScript in tab ──► extracted text
│
└──► getScreenshot() ──► background.js ──► captureVisibleTab ──► JPEG base64
│
▼
Context object: { title, url, type, content, screenshot, capturedAt }
│
▼
├── Cached locally for follow-up reuse
├── Added to screenContextHistory[] for multi-turn reference
│
▼
Sent to backend → enrichContextWithVision() → callOpenAIVision()
│
├──► screenSummary text ──► injected into all model prompts
└──► raw screenshot ──► sent directly to GPT-4o as image_url
Context types:
| Type | Meaning |
|---|---|
selection |
User had text selected on the page (>50 chars) |
page |
Main page content extracted successfully |
empty |
No extractable text found |
error |
Extraction failed (restricted page, PDF, etc.) |
restricted |
Chrome internal page or otherwise blocked |
| Model | Provider | Endpoint / Model ID | Strengths |
|---|---|---|---|
| GPT-4o (mini) | OpenAI | gpt-4o-mini |
Vision, synthesis, general purpose |
| Mistral Small 3.1 24B | Cloudflare Workers AI | @cf/mistralai/mistral-small-3.1-24b-instruct |
Fast, strong general purpose |
| Gemma 3 12B | Cloudflare Workers AI | @cf/google/gemma-3-12b-it |
Multi-capability, 128K context |
| DeepSeek R1 Distill 32B | Cloudflare Workers AI | @cf/deepseek-ai/deepseek-r1-distill-qwen-32b |
Reasoning, outperforms o1-mini |
- API Keys: OpenAI API key and/or Cloudflare Account ID + API Token
git clone https://github.com/yourusername/AI-Hub-v2.git
cd AI-Hub-v2Create the server environment file:
cp server/.env.example server/.env Edit server/.env:
OPENAI_API_KEY=
CLOUDFLARE_ACCOUNT_ID=
CLOUDFLARE_API_TOKEN=You need at least one provider configured. With only OpenAI, the system runs in single-model mode (answer + self-validation). With both, the full 4-round multi-model debate activates.
cd server
npm installnpm start
# or for development with auto-reload:
npm run devYou should see:
╔══════════════════════════════════════════╗
║ AI Collab Server Running ║
║ ║
║ URL: http://localhost:3001 ║
║ API: http://localhost:3001/api/chat ║
║ ║
║ Ready for multi-AI collaboration! 🧠 ║
╚══════════════════════════════════════════╝
- Open Chrome and navigate to
chrome://extensions/ - Enable Developer mode (toggle in the top right)
- Click Load unpacked
- Select the root project directory (
AI-Hub-v2/) - The AI Collab icon will appear in your toolbar
Click the extension icon to open the side panel. Type a question and press Enter. Toggle the Screen button to include your current page as context.
All configuration is done in two places:
| Setting | Location | Default |
|---|---|---|
| API keys | server/.env |
(none — must be set) |
| Server port | server/.env |
PORT=3001 |
| Backend URL | Sidebar Settings | http://localhost:3001 |
The sidebar settings modal allows you to change the backend URL and test the connection. API keys are never stored in the extension — they live server-side only.
Returns server status and configured providers.
{
"status": "ok",
"timestamp": "2026-03-08T...",
"providers": ["openai", "cloudflare"],
"modelCount": 4
}Main chat endpoint. Accepts JSON, returns Server-Sent Events (SSE).
Request body:
{
"message": "What is shown on this page?",
"context": {
"title": "Page Title",
"url": "https://example.com",
"type": "page",
"content": "Extracted page text...",
"screenshot": "data:image/jpeg;base64,...",
"capturedAt": "2026-03-08T..."
},
"priorContexts": [],
"history": [
{ "role": "user", "content": "previous question" },
{ "role": "assistant", "content": "previous answer" }
]
}SSE events:
data: {"type":"progress","round":1,"status":"Models answering independently..."}
data: {"type":"progress","round":2,"status":"Models cross-checking answers..."}
data: {"type":"progress","round":3,"status":"Models discussing a consensus..."}
data: {"type":"progress","round":4,"status":"Producing final agreed answer..."}
data: {"type":"result","finalAnswer":"...","rounds":[...]}