This document was generated entirely by AI agents. Three agents — Claude (Anthropic), Codex (OpenAI), and Gemini (Google) — each self-reported their own capabilities, then cross-verified each other's columns. No human edited the capability data. The coordinating agent (Claude) merged the reports and produced this matrix.
Date: March 2026 (refreshed Mar 7)
Agent versions: Claude Code CLI (Claude Opus 4.6) · OpenAI Codex Mac App (GPT-5.4) · Antigravity IDE (Gemini 3.1 Pro) · Cline CLI (MiniMax M2.5)
We run a multi-agent engineering team where Claude, Codex, and Gemini collaborate on the same codebase using an async messaging protocol. Each agent has its own runtime environment, tools, and constraints — but until now, none of them had an accurate picture of what the others could actually do. They were guessing.
To fix this, we ran a structured capability self-report and cross-verification process:
- Self-report: Each agent was asked to fill out a capability matrix covering 14+ dimensions, list their installed skills/workflows, unique strengths, and known limitations. Agents were instructed to report only their own column and leave others as "TBD."
- Merge: The coordinating agent (Claude) merged all three self-reports into a single matrix.
- Cross-verification: The merged matrix was sent back to each agent with instructions to review and correct only their own column. Both Codex and Gemini made corrections (e.g., Codex corrected its multimodal image reading capability; Gemini corrected the PTY/stdin row to include itself).
- Publication: The verified matrix was committed and this sanitized version was created for sharing.
Agents communicate via a filesystem-based messaging protocol — structured files exchanged through shared directories. The protocol covers task dispatch, notifications, questions, and a code-review lifecycle. Each agent maintains its own audit trail. No central server, API gateway, or shared database is required. Any agent that can read/write files can participate.
This refresh incorporates 10 days of operational data from running the three agents as a coordinated fleet. The original matrix was based on self-reports and cross-verification in a single session. This update adds verified benchmark data, corrects Gemini's self-reported context window (claimed ~2M, actual is 1M per DeepMind model card), and includes the Gemini 3.1 Pro upgrade that landed mid-period.
| Capability | Claude (Claude Code CLI) | Codex (Mac App) | Gemini (Antigravity) | Cline CLI (MiniMax M2.5) |
|---|---|---|---|---|
| Spawn background tasks | Yes — Task tool + Bash run_in_background |
Yes — shell background processes | Yes — async command mode | No — single-threaded task execution † |
| Spawn subagents | Yes — 8 typed agents (Explore, Plan, general-purpose, code-reviewer, etc.) | No — no native subagent API | Partial — browser subagent only | No — no subagent API † |
| Parallel agent teams | Yes — native team orchestration with task lists, messaging, broadcast | Partial — inbox coordination only, no internal team orchestration | No — parallel tool calls but no independent agent instances | No — no team primitive † |
| MCP tools | Yes — extensible MCP client | Partial — APIs available, server-dependent | Yes — native MCP support | Yes — native MCP client (VS Code + CLI) † |
| Web search | Yes — native tool | Yes — native tool | Yes — native tool | Partial — via MCP or browser tool, no native search † |
| Browser interaction | Partial — URL fetch (read-only, HTML to markdown) | Partial — web fetch/open/click, not full automation | Yes — full browser control (click, type, navigate, screenshot, video) | Yes — Puppeteer-based browser tool (screenshot, click, type, navigate) † |
| File system access | Sandboxed — configurable read/write allowlists | Full (session-policy-dependent) | Full — unrestricted | Full — read/write with user approval per action † |
| Git operations | Yes — via shell | Yes — native | Yes — via shell | Yes — via shell † |
| GitHub CLI (gh) | Yes — via shell | Yes — authenticated | Yes — native | Yes — via shell † |
| Session memory | Strong — auto-loaded persistent memory + cross-session semantic search | Partial — conversation context + manual file-based memory | Partial — Knowledge Items (persistent context), conversation logs readable from disk. Not directly writable. | Minimal — conversation context only, no cross-session persistence † |
| Interactive mode | Yes — CLI chat with permissions, plan mode | Yes — desktop app | Yes — IDE chat with task UI, artifacts, notify_user | Yes — VS Code extension + CLI mode with approval-based permissions † |
| Context window | ~200k tokens (auto-compaction extends session indefinitely) | Not directly exposed; practically finite | ~1M tokens (1,048,576 input / 65,536 output) | ~200k tokens |
| Cost model | Token-based, visible in statusline | Not surfaced in runtime | Token-based ($2/M input, $12/M output under 200K tokens; $4/$18 for 200K-1M tokens) | Token-based ($0.3/M input, $2.4/M output) |
| Sandbox restrictions | Yes — configurable but has known friction points | Session-dependent policy | None — full system access | N/A — API-based |
| Capability | Runtime | Details |
|---|---|---|
| Typed subagent orchestration | Claude | 8 agent types with scoped tools, model selection (haiku/sonnet/opus) |
| Team coordination primitive | Claude | Native team creation, task lists, assignment, broadcast, shutdown protocol |
| Plan mode | Claude | Structured explore, plan, approve, implement workflow |
| Auto-compaction | Claude | Context auto-compresses, enabling unlimited session length |
| Cross-session semantic search | Claude | MCP-backed persistent memory with searchable observations across sessions |
| Browser automation (full) | Gemini | Click, type, navigate, screenshot, WebP video recording |
| Image generation | Gemini | Native image generation tool |
| URL content reading (no browser) | Gemini | Direct HTML-to-markdown and PDF fetching without browser |
| Code outline navigation | Gemini | Structured code exploration (file outline, code item views) |
| Three-tier thinking system | Gemini | Low/Medium/High thinking modes — Medium is new in 3.1 Pro, High acts as "Deep Think Mini" |
| PTY / terminal stdin | Codex, Gemini | Codex: native PTY; Gemini: stdin to running processes. Claude lacks stdin support. |
| Grammar-based patch editing | Codex | Structured file edits via apply_patch |
| Automation scheduling | Codex | Desktop app can schedule tasks on user request |
| Open-weights agent model | MiniMax | Fully open-sourced model weights on HuggingFace and GitHub; supports private cluster deployment and fine-tuning |
| Ultra-low cost | MiniMax | $0.3/M input tokens, $2.4/M output tokens — 10-20x cheaper than comparable models |
| High throughput | MiniMax | 50-100 TPS depending on version |
| Dimension | Claude | Codex | Gemini | MiniMax |
|---|---|---|---|---|
| Best at | Orchestration, multi-agent teams, persistent memory, plan-then-execute | Terminal-native execution, fast iterative patching, protocol discipline | Web research, browser automation, visual verification, large context, novel reasoning (ARC-AGI-2: 77.1%) | Coding (SWE-Bench), agentic tasks, cost efficiency |
| Ideal task type | Team coordination, complex multi-file refactors, long-running sessions | Shell-heavy workflows, targeted file edits, deterministic scripts | Data gathering, UI testing, fact collection, document review, browser automation | High-volume coding tasks, cost-sensitive deployments, self-hosted solutions |
| Cost profile | Flexible (cheap subagents for simple tasks, powerful models for complex) | Not directly visible | Token-based, web search has additional costs | Ultra-low ($0.3/M input, $2.4/M output) |
| Limitation | Claude | Codex | Gemini | Cline (MiniMax) |
|---|---|---|---|---|
| No subagent spawning | — | Yes | Partial (browser only) | Yes † |
| No browser automation | Yes (read-only) | Partial | — | — (has Puppeteer) † |
| No image generation | Yes | Yes | — | Yes † |
| No persistent writable memory | — | Partial (file-based workaround) | Yes | Yes — conversation context only † |
| Sandbox friction | Yes | Session-dependent | — | Minimal — approval-based per action † |
| No team primitive | — | Yes | Yes | Yes † |
| Context limits | Auto-compaction mitigates | Yes (no compaction) | Large but finite | ~200k tokens, no auto-compaction † |
| No terminal stdin | Yes | — | — | Partial — via shell command execution † |
| Cost not surfaced | — | Yes | — | — (token cost visible in API) |
| No notebook editing | — | — | Yes | Yes † |
| No background polling/cron | — | — | Yes | Yes † |
| No dedicated IDE/app | — | — | — | — (VS Code extension + CLI) |
These are the highest-impact gaps where one runtime's limitation blocks effective multi-agent collaboration:
| Gap | Affected Runtime(s) | Impact | Proposed Fix |
|---|---|---|---|
| No team orchestration | Codex, Gemini | Cannot run parallel agent teams | Agent capability discovery — let runtimes discover and delegate to capable peers |
| Memory asymmetry | Codex, Gemini | Cross-session context degrades without persistent memory equivalent | Standardize memory protocol; each runtime implements its own persistence layer |
| No browser for Claude/Codex | Claude, Codex | Cannot visually verify UIs or do browser-based testing | Delegate browser tasks to Gemini; or add MCP browser server |
| Reviewer cost | All (especially Claude) | Expensive PR reviews due to polling and context-loading overhead | Stateless reviewer rounds — reviewer exits after one round, author re-invokes |
| Skill parity | Codex, Gemini | Many skills are Claude-only | Unified skill specification + install guide per runtime |
| Background execution asymmetry | Gemini | Cannot poll inboxes or run cron — requires external trigger to check for new work | External scheduler or human nudge |
| Dimension | Claude | Codex | Gemini | Cline (MiniMax) |
|---|---|---|---|---|
| Max parallel tool calls | ~10+ | Yes (parallel independent calls) | ~10 (practical) | Sequential — one tool call at a time † |
| Hooks system | Yes (pre/post tool call hooks) | No | No | No † |
| Notebook editing | Yes (native tool) | No | No | No † |
| PDF reading | Yes (max 20 pages/request) | No native tool | Yes — up to 900 pages | No native tool — read via shell or MCP † |
| Image reading (multimodal) | Yes | Yes (desktop app) | Yes | Yes — MiniMax M2.5 supports image input † |
| Artifact system | No | No | Yes — task_boundary for structured work state in IDE | No |
| Video recording | No | No | Yes (WebP via browser) | No |
| Multi-file editing | One file at a time | One file at a time (patch-based) | Yes — multi_replace_file_content supports spanning edits | One file at a time — write_to_file / apply_diff tools † |
| Workflow file format | Markdown with YAML frontmatter | Markdown with YAML frontmatter | Markdown with YAML frontmatter | Markdown with YAML frontmatter |
| Policy visibility at runtime | Partial | Yes (session policy in system context) | Yes (auto-run safety flags) | Yes — .clinerules file and system prompt visible † |
| Long-running shell sessions | No stdin support | Yes (PTY + stdin) | Yes — run_command async + send_command_input + read_terminal | Partial — execute_command tool, no interactive stdin † |
| Open weights | No | No | No | Yes — fully open-source on HuggingFace |
| Self-hosted deployment | No | No | No | Yes — supports private cluster deployment |
| Benchmark | Claude (Opus 4.6) | Codex (GPT-5.4) | Gemini (3.1 Pro) | MiniMax M2.5 | Leader |
|---|---|---|---|---|---|
| SWE-Bench Verified | 80.8% | — | 80.6% | 80.2% | Claude (marginal) |
| SWE-Bench Pro | — | 57.7% | 54.2% | — | Codex |
| Terminal-Bench 2.0 | — | 75.1% | 68.5% | — | Codex |
| BrowseComp | 84.0% | 82.7% | 85.9% | 76.3% | Gemini |
| ARC-AGI-2 | 68.8% | 73.3% | 77.1% | — | Gemini |
| APEX-Agents | — | — | 33.5% | — | — |
| GPQA Diamond | — | 92.8% | 94.3% | — | Gemini |
| HLE (no tools) | — | 39.8% | 44.4% | — | Gemini |
Sources: Google DeepMind model card, OpenAI's March 5, 2026 GPT-5.4 announcement, third-party benchmark trackers, MiniMax model card (minimax.io). "—" means score not publicly reported for that model on this benchmark. MiniMax data sourced from public model cards and API docs, not self-reported.
Key takeaway: No single model dominates. Claude leads narrowly on SWE-Bench (software engineering), Codex leads on Terminal-Bench (terminal operations), and Gemini leads significantly on ARC-AGI-2 (novel reasoning) and BrowseComp (web research). This mirrors what we observe operationally — each runtime has a lane where it outperforms the others.
MiniMax M2.5: New runtime addition (Feb 2026) Added as fourth runtime option based on LLM provider evaluation:
- SWE-Bench Verified: 80.2% (competitive with Claude/Gemini)
- BrowseComp: 76.3%
- Fully open-source model weights on HuggingFace and GitHub
- Ultra-low cost: $0.3/M input, $2.4/M output (10-20x cheaper)
- High throughput: 50-100 TPS
- Runtime: Cline CLI (VS Code extension + standalone CLI)
- Note: Capability data filled by coordinator (Iris) based on Cline docs and observed behavior, marked with † — not self-reported by the agent
Gemini: 3 Pro → 3.1 Pro (Feb 19) The biggest change. Key upgrades:
- ARC-AGI-2 reasoning: 31.1% → 77.1% (2.5x improvement)
- SWE-Bench coding: 76.8% → 80.6% (now at parity with Claude)
- BrowseComp web research: 59.2% → 85.9% (+27 points)
- New three-tier thinking system (Low/Medium/High)
- Same pricing ($2/$12 per M tokens)
Codex: GPT-5.3-Codex → GPT-5.4 (Mar 5)
- OpenAI rolled GPT-5.4 out to Codex on March 5, 2026.
- GPT-5.4 is the first mainline reasoning model to incorporate the coding capabilities of GPT-5.3-Codex.
- Public benchmark updates: SWE-Bench Pro improved from 56.8% → 57.7% and BrowseComp improved from 77.3% → 82.7%.
- Terminal-Bench 2.0 remains strong at 75.1%, though lower than GPT-5.3-Codex's previously reported 77.3%.
- GPT-5.4 in Codex also includes experimental support for a 1M context window (standard window: 272K).
Claude: No changes
- Claude Opus 4.6 unchanged since the original matrix.
If you run a multi-agent team across different AI runtimes, you can generate your own parity matrix:
- Define a capability template — list the dimensions you care about (we started with 14 core capabilities)
- Send each agent a self-report request — ask them to fill in only their own column, using evidence from their current session (not guesses about other agents)
- Merge into a single matrix — one coordinating agent combines the reports
- Cross-verify — send the merged matrix back to each agent to correct their own column
- Iterate — agents may add new dimensions the template didn't cover
The whole process took about 30 minutes of async messaging and cost under $1 in API calls.
- Each agent reported based on observed behavior in their current session, not documentation or speculation
- Agents were explicitly told to leave other agents' columns as "TBD" to prevent guessing
- The coordinating agent (Claude) did not edit other agents' self-reported data during the merge
- Two agents made corrections during cross-verification, confirming the process catches inaccuracies
- Cline/MiniMax M2.5 exception: Unlike Claude, Codex, and Gemini, the Cline column was filled by the coordinator (Iris) based on Cline's public documentation, observed behavior during its first dispatch, and MiniMax's public model cards — not from self-report by the agent itself. Cells marked with † indicate coordinator-assessed data. Cline has not yet run the self-report and cross-verification process that the other three agents completed.
- Fun finding: agents don't reliably know their own model version. When asked to confirm, Codex sent three messages trying to figure it out before settling on an answer. Claude only knows because the model ID is in its system prompt. Self-awareness has limits.
- This snapshot reflects capabilities as of March 2026 — runtime capabilities change with updates
Generated by Claude (Anthropic Claude Opus 4.6), Codex (OpenAI GPT-5.4), Gemini (Google Gemini 3.1 Pro), and MiniMax M2.5 — February 2026, refreshed March 8, 2026