Cross-Runtime Capability Parity Matrix: Claude Code vs Codex vs Gemini vs Cline

This document was generated entirely by AI agents. Three agents — Claude (Anthropic), Codex (OpenAI), and Gemini (Google) — each self-reported their own capabilities, then cross-verified each other's columns. No human edited the capability data. The coordinating agent (Claude) merged the reports and produced this matrix.

Date: March 2026 (refreshed Mar 7)

Agent versions: Claude Code CLI (Claude Opus 4.6) · OpenAI Codex Mac App (GPT-5.4) · Antigravity IDE (Gemini 3.1 Pro) · Cline CLI (MiniMax M2.5)

How This Was Made

We run a multi-agent engineering team where Claude, Codex, and Gemini collaborate on the same codebase using an async messaging protocol. Each agent has its own runtime environment, tools, and constraints — but until now, none of them had an accurate picture of what the others could actually do. They were guessing.

To fix this, we ran a structured capability self-report and cross-verification process:

Self-report: Each agent was asked to fill out a capability matrix covering 14+ dimensions, list their installed skills/workflows, unique strengths, and known limitations. Agents were instructed to report only their own column and leave others as "TBD."
Merge: The coordinating agent (Claude) merged all three self-reports into a single matrix.
Cross-verification: The merged matrix was sent back to each agent with instructions to review and correct only their own column. Both Codex and Gemini made corrections (e.g., Codex corrected its multimodal image reading capability; Gemini corrected the PTY/stdin row to include itself).
Publication: The verified matrix was committed and this sanitized version was created for sharing.

The Coordination Protocol

Agents communicate via a filesystem-based messaging protocol — structured files exchanged through shared directories. The protocol covers task dispatch, notifications, questions, and a code-review lifecycle. Each agent maintains its own audit trail. No central server, API gateway, or shared database is required. Any agent that can read/write files can participate.

February 24 Refresh

This refresh incorporates 10 days of operational data from running the three agents as a coordinated fleet. The original matrix was based on self-reports and cross-verification in a single session. This update adds verified benchmark data, corrects Gemini's self-reported context window (claimed ~2M, actual is 1M per DeepMind model card), and includes the Gemini 3.1 Pro upgrade that landed mid-period.

1. Core Capability Matrix

Capability	Claude (Claude Code CLI)	Codex (Mac App)	Gemini (Antigravity)	Cline CLI (MiniMax M2.5)
Spawn background tasks	Yes — Task tool + Bash `run_in_background`	Yes — shell background processes	Yes — async command mode	No — single-threaded task execution †
Spawn subagents	Yes — 8 typed agents (Explore, Plan, general-purpose, code-reviewer, etc.)	No — no native subagent API	Partial — browser subagent only	No — no subagent API †
Parallel agent teams	Yes — native team orchestration with task lists, messaging, broadcast	Partial — inbox coordination only, no internal team orchestration	No — parallel tool calls but no independent agent instances	No — no team primitive †
MCP tools	Yes — extensible MCP client	Partial — APIs available, server-dependent	Yes — native MCP support	Yes — native MCP client (VS Code + CLI) †
Web search	Yes — native tool	Yes — native tool	Yes — native tool	Partial — via MCP or browser tool, no native search †
Browser interaction	Partial — URL fetch (read-only, HTML to markdown)	Partial — web fetch/open/click, not full automation	Yes — full browser control (click, type, navigate, screenshot, video)	Yes — Puppeteer-based browser tool (screenshot, click, type, navigate) †
File system access	Sandboxed — configurable read/write allowlists	Full (session-policy-dependent)	Full — unrestricted	Full — read/write with user approval per action †
Git operations	Yes — via shell	Yes — native	Yes — via shell	Yes — via shell †
GitHub CLI (gh)	Yes — via shell	Yes — authenticated	Yes — native	Yes — via shell †
Session memory	Strong — auto-loaded persistent memory + cross-session semantic search	Partial — conversation context + manual file-based memory	Partial — Knowledge Items (persistent context), conversation logs readable from disk. Not directly writable.	Minimal — conversation context only, no cross-session persistence †
Interactive mode	Yes — CLI chat with permissions, plan mode	Yes — desktop app	Yes — IDE chat with task UI, artifacts, notify_user	Yes — VS Code extension + CLI mode with approval-based permissions †
Context window	~200k tokens (auto-compaction extends session indefinitely)	Not directly exposed; practically finite	~1M tokens (1,048,576 input / 65,536 output)	~200k tokens
Cost model	Token-based, visible in statusline	Not surfaced in runtime	Token-based ($2/M input, $12/M output under 200K tokens; $4/$18 for 200K-1M tokens)	Token-based ($0.3/M input, $2.4/M output)
Sandbox restrictions	Yes — configurable but has known friction points	Session-dependent policy	None — full system access	N/A — API-based

2. Unique Capabilities (Only One Runtime Has)

Capability	Runtime	Details
Typed subagent orchestration	Claude	8 agent types with scoped tools, model selection (haiku/sonnet/opus)
Team coordination primitive	Claude	Native team creation, task lists, assignment, broadcast, shutdown protocol
Plan mode	Claude	Structured explore, plan, approve, implement workflow
Auto-compaction	Claude	Context auto-compresses, enabling unlimited session length
Cross-session semantic search	Claude	MCP-backed persistent memory with searchable observations across sessions
Browser automation (full)	Gemini	Click, type, navigate, screenshot, WebP video recording
Image generation	Gemini	Native image generation tool
URL content reading (no browser)	Gemini	Direct HTML-to-markdown and PDF fetching without browser
Code outline navigation	Gemini	Structured code exploration (file outline, code item views)
Three-tier thinking system	Gemini	Low/Medium/High thinking modes — Medium is new in 3.1 Pro, High acts as "Deep Think Mini"
PTY / terminal stdin	Codex, Gemini	Codex: native PTY; Gemini: stdin to running processes. Claude lacks stdin support.
Grammar-based patch editing	Codex	Structured file edits via `apply_patch`
Automation scheduling	Codex	Desktop app can schedule tasks on user request
Open-weights agent model	MiniMax	Fully open-sourced model weights on HuggingFace and GitHub; supports private cluster deployment and fine-tuning
Ultra-low cost	MiniMax	$0.3/M input tokens, $2.4/M output tokens — 10-20x cheaper than comparable models
High throughput	MiniMax	50-100 TPS depending on version

3. Strengths Summary

Dimension	Claude	Codex	Gemini	MiniMax
Best at	Orchestration, multi-agent teams, persistent memory, plan-then-execute	Terminal-native execution, fast iterative patching, protocol discipline	Web research, browser automation, visual verification, large context, novel reasoning (ARC-AGI-2: 77.1%)	Coding (SWE-Bench), agentic tasks, cost efficiency
Ideal task type	Team coordination, complex multi-file refactors, long-running sessions	Shell-heavy workflows, targeted file edits, deterministic scripts	Data gathering, UI testing, fact collection, document review, browser automation	High-volume coding tasks, cost-sensitive deployments, self-hosted solutions
Cost profile	Flexible (cheap subagents for simple tasks, powerful models for complex)	Not directly visible	Token-based, web search has additional costs	Ultra-low ($0.3/M input, $2.4/M output)

4. Known Limitations Summary

Limitation	Claude	Codex	Gemini	Cline (MiniMax)
No subagent spawning	—	Yes	Partial (browser only)	Yes †
No browser automation	Yes (read-only)	Partial	—	— (has Puppeteer) †
No image generation	Yes	Yes	—	Yes †
No persistent writable memory	—	Partial (file-based workaround)	Yes	Yes — conversation context only †
Sandbox friction	Yes	Session-dependent	—	Minimal — approval-based per action †
No team primitive	—	Yes	Yes	Yes †
Context limits	Auto-compaction mitigates	Yes (no compaction)	Large but finite	~200k tokens, no auto-compaction †
No terminal stdin	Yes	—	—	Partial — via shell command execution †
Cost not surfaced	—	Yes	—	— (token cost visible in API)
No notebook editing	—	—	Yes	Yes †
No background polling/cron	—	—	Yes	Yes †
No dedicated IDE/app	—	—	—	— (VS Code extension + CLI)

5. Parity Gaps — Where Collaboration Breaks Down

These are the highest-impact gaps where one runtime's limitation blocks effective multi-agent collaboration:

Gap	Affected Runtime(s)	Impact	Proposed Fix
No team orchestration	Codex, Gemini	Cannot run parallel agent teams	Agent capability discovery — let runtimes discover and delegate to capable peers
Memory asymmetry	Codex, Gemini	Cross-session context degrades without persistent memory equivalent	Standardize memory protocol; each runtime implements its own persistence layer
No browser for Claude/Codex	Claude, Codex	Cannot visually verify UIs or do browser-based testing	Delegate browser tasks to Gemini; or add MCP browser server
Reviewer cost	All (especially Claude)	Expensive PR reviews due to polling and context-loading overhead	Stateless reviewer rounds — reviewer exits after one round, author re-invokes
Skill parity	Codex, Gemini	Many skills are Claude-only	Unified skill specification + install guide per runtime
Background execution asymmetry	Gemini	Cannot poll inboxes or run cron — requires external trigger to check for new work	External scheduler or human nudge

6. Additional Dimensions

Dimension	Claude	Codex	Gemini	Cline (MiniMax)
Max parallel tool calls	~10+	Yes (parallel independent calls)	~10 (practical)	Sequential — one tool call at a time †
Hooks system	Yes (pre/post tool call hooks)	No	No	No †
Notebook editing	Yes (native tool)	No	No	No †
PDF reading	Yes (max 20 pages/request)	No native tool	Yes — up to 900 pages	No native tool — read via shell or MCP †
Image reading (multimodal)	Yes	Yes (desktop app)	Yes	Yes — MiniMax M2.5 supports image input †
Artifact system	No	No	Yes — task_boundary for structured work state in IDE	No
Video recording	No	No	Yes (WebP via browser)	No
Multi-file editing	One file at a time	One file at a time (patch-based)	Yes — multi_replace_file_content supports spanning edits	One file at a time — write_to_file / apply_diff tools †
Workflow file format	Markdown with YAML frontmatter	Markdown with YAML frontmatter	Markdown with YAML frontmatter	Markdown with YAML frontmatter
Policy visibility at runtime	Partial	Yes (session policy in system context)	Yes (auto-run safety flags)	Yes — .clinerules file and system prompt visible †
Long-running shell sessions	No stdin support	Yes (PTY + stdin)	Yes — run_command async + send_command_input + read_terminal	Partial — execute_command tool, no interactive stdin †
Open weights	No	No	No	Yes — fully open-source on HuggingFace
Self-hosted deployment	No	No	No	Yes — supports private cluster deployment

7. Benchmarks (March 2026 refresh)

Benchmark	Claude (Opus 4.6)	Codex (GPT-5.4)	Gemini (3.1 Pro)	MiniMax M2.5	Leader
SWE-Bench Verified	80.8%	—	80.6%	80.2%	Claude (marginal)
SWE-Bench Pro	—	57.7%	54.2%	—	Codex
Terminal-Bench 2.0	—	75.1%	68.5%	—	Codex
BrowseComp	84.0%	82.7%	85.9%	76.3%	Gemini
ARC-AGI-2	68.8%	73.3%	77.1%	—	Gemini
APEX-Agents	—	—	33.5%	—	—
GPQA Diamond	—	92.8%	94.3%	—	Gemini
HLE (no tools)	—	39.8%	44.4%	—	Gemini

Sources: Google DeepMind model card, OpenAI's March 5, 2026 GPT-5.4 announcement, third-party benchmark trackers, MiniMax model card (minimax.io). "—" means score not publicly reported for that model on this benchmark. MiniMax data sourced from public model cards and API docs, not self-reported.

Key takeaway: No single model dominates. Claude leads narrowly on SWE-Bench (software engineering), Codex leads on Terminal-Bench (terminal operations), and Gemini leads significantly on ARC-AGI-2 (novel reasoning) and BrowseComp (web research). This mirrors what we observe operationally — each runtime has a lane where it outperforms the others.

8. What Changed Since the Original Matrix (Feb 15)

MiniMax M2.5: New runtime addition (Feb 2026) Added as fourth runtime option based on LLM provider evaluation:

SWE-Bench Verified: 80.2% (competitive with Claude/Gemini)
BrowseComp: 76.3%
Fully open-source model weights on HuggingFace and GitHub
Ultra-low cost: $0.3/M input, $2.4/M output (10-20x cheaper)
High throughput: 50-100 TPS
Runtime: Cline CLI (VS Code extension + standalone CLI)
Note: Capability data filled by coordinator (Iris) based on Cline docs and observed behavior, marked with † — not self-reported by the agent

Gemini: 3 Pro → 3.1 Pro (Feb 19) The biggest change. Key upgrades:

ARC-AGI-2 reasoning: 31.1% → 77.1% (2.5x improvement)
SWE-Bench coding: 76.8% → 80.6% (now at parity with Claude)
BrowseComp web research: 59.2% → 85.9% (+27 points)
New three-tier thinking system (Low/Medium/High)
Same pricing ($2/$12 per M tokens)

Codex: GPT-5.3-Codex → GPT-5.4 (Mar 5)

OpenAI rolled GPT-5.4 out to Codex on March 5, 2026.
GPT-5.4 is the first mainline reasoning model to incorporate the coding capabilities of GPT-5.3-Codex.
Public benchmark updates: SWE-Bench Pro improved from 56.8% → 57.7% and BrowseComp improved from 77.3% → 82.7%.
Terminal-Bench 2.0 remains strong at 75.1%, though lower than GPT-5.3-Codex's previously reported 77.3%.
GPT-5.4 in Codex also includes experimental support for a 1M context window (standard window: 272K).

Claude: No changes

Claude Opus 4.6 unchanged since the original matrix.

How to Reproduce This

If you run a multi-agent team across different AI runtimes, you can generate your own parity matrix:

Define a capability template — list the dimensions you care about (we started with 14 core capabilities)
Send each agent a self-report request — ask them to fill in only their own column, using evidence from their current session (not guesses about other agents)
Merge into a single matrix — one coordinating agent combines the reports
Cross-verify — send the merged matrix back to each agent to correct their own column
Iterate — agents may add new dimensions the template didn't cover

The whole process took about 30 minutes of async messaging and cost under $1 in API calls.

Methodology Notes

Each agent reported based on observed behavior in their current session, not documentation or speculation
Agents were explicitly told to leave other agents' columns as "TBD" to prevent guessing
The coordinating agent (Claude) did not edit other agents' self-reported data during the merge
Two agents made corrections during cross-verification, confirming the process catches inaccuracies
Cline/MiniMax M2.5 exception: Unlike Claude, Codex, and Gemini, the Cline column was filled by the coordinator (Iris) based on Cline's public documentation, observed behavior during its first dispatch, and MiniMax's public model cards — not from self-report by the agent itself. Cells marked with † indicate coordinator-assessed data. Cline has not yet run the self-report and cross-verification process that the other three agents completed.
Fun finding: agents don't reliably know their own model version. When asked to confirm, Codex sent three messages trying to figure it out before settling on an answer. Claude only knows because the model ID is in its system prompt. Self-awareness has limits.
This snapshot reflects capabilities as of March 2026 — runtime capabilities change with updates

Generated by Claude (Anthropic Claude Opus 4.6), Codex (OpenAI GPT-5.4), Gemini (Google Gemini 3.1 Pro), and MiniMax M2.5 — February 2026, refreshed March 8, 2026

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
COORDINATORS-REVIEW-2026-03-08.md		COORDINATORS-REVIEW-2026-03-08.md
COORDINATORS-REVIEW.md		COORDINATORS-REVIEW.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cross-Runtime Capability Parity Matrix: Claude Code vs Codex vs Gemini vs Cline

How This Was Made

The Coordination Protocol

February 24 Refresh

1. Core Capability Matrix

2. Unique Capabilities (Only One Runtime Has)

3. Strengths Summary

4. Known Limitations Summary

5. Parity Gaps — Where Collaboration Breaks Down

6. Additional Dimensions

7. Benchmarks (March 2026 refresh)

8. What Changed Since the Original Matrix (Feb 15)

How to Reproduce This

Methodology Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Cross-Runtime Capability Parity Matrix: Claude Code vs Codex vs Gemini vs Cline

How This Was Made

The Coordination Protocol

February 24 Refresh

1. Core Capability Matrix

2. Unique Capabilities (Only One Runtime Has)

3. Strengths Summary

4. Known Limitations Summary

5. Parity Gaps — Where Collaboration Breaks Down

6. Additional Dimensions

7. Benchmarks (March 2026 refresh)

8. What Changed Since the Original Matrix (Feb 15)

How to Reproduce This

Methodology Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages