An active oversight layer (cockpit) for Claude Code agents — not an agent itself.
Cockpit doesn't run your tasks. Claude Code's agents do that. Cockpit is the control surface that sits between you and them — it forces agents to declare their plan before acting, surfaces every micro-decision for human review, makes agent thinking visible during execution, and pauses on critical choices so you can override.
Think of it like a pilot's cockpit: the pilot still flies the plane, but the cockpit makes the plane controllable. Cockpit makes Claude Code agents controllable.
> "AI tools are honestly unusable without running in yolo mode.
> You have to baby every single little command. It is utterly miserable and awful."
> — Hacker News, 255 points
Cockpit is the middle ground between "yolo" and "baby every command".
I'm Cathy — Senior PM at DiDi, pivoting to AI-native product. I started using Claude Code daily in early 2026 and hit the wall the HN comment above describes: either lose visibility ("yolo mode") or babysit every command. Neither works for product thinking.
agent-cockpit operationalizes "human-in-the-loop" for the IDE-agent era. It's how I delegate to AI agents the way I delegate to a junior PM — explicit decision points, surfaced thinking, and the ability to redirect mid-flight.
This is also my first major contribution to the AI agent infrastructure layer. If you're building agent products and want to compare notes on human-AI collaboration patterns, open an issue.
# 1. Clone Cockpit
git clone https://github.com/cathyzhang0905/agent-cockpit.git
# 2. Make hooks executable
cd agent-cockpit && chmod +x hooks/*.py
# 3. Apply for a free GLM-4-Flash key (recommended — 完全免费)
# 👉 https://open.bigmodel.cn/ (手机号注册即可,不需要信用卡)
# Then add to your shell:
echo 'export ZHIPU_API_KEY="your-glm-key"' >> ~/.zshrc
source ~/.zshrc
# 4. Run Claude Code with Cockpit
claude --plugin-dir /path/to/agent-cockpitThat's it. Now Cockpit auto-classifies your tasks (using GLM-4-Flash for the ambiguous ones) and surfaces decision points before tool execution.
💡 Why GLM-4-Flash? 智谱 AI 的 GLM-4-Flash 模型完全免费,中文支持好,是 Cockpit 推荐的 zero-cost 上手路径。已有其他 LLM key(OpenAI / DeepSeek / Qwen / Kimi 等)也都直接支持——见下面 LLM provider config。
When you delegate a task to an AI agent, you lose access to its decision-making at exactly the moments when you most need to be involved.
Three failure modes this creates:
- Pre-execution blindness — You hit Enter; you don't know what the agent is about to do until it's already doing it.
- Mid-execution invisibility — Agent makes 5-50 micro-decisions per task (which tool, which source, what assumption). All invisible. Each one shapes the output silently.
- Post-execution attribution gap — When output is wrong, you can't trace it back to the decision that caused it.
Cockpit's five core functions map directly:
| Failure mode | Cockpit solution |
|---|---|
| ① Pre-execution blindness | Plan-First — agent declares plan; β verification ensures plan exists before any tool runs |
| ② Mid-execution invisibility | Decision Markers (6 commit tools) + Thinking Checkpoint (every tool) + Assumption Marker (data analysis) |
| ③ Post-execution attribution gap | Brief — agent flags decisions worth review + JSONL log (mirror disler schema) |
The human decision step must be explicit and never bypassed.
This is humanism implemented at the product layer. The agent never makes a hidden decision; the human is always one keystroke away from intervention.
Cockpit is complementary, not competitive, with existing Claude Code observability:
| Tool | What it does | What Cockpit adds |
|---|---|---|
| disler/claude-code-hooks-multi-agent-observability | Real-time event dashboard for what happened | Active oversight for what's about to happen + thinking visibility |
| simple10/agents-observe | Multi-agent hierarchy visualization | Pre-execution decision review + assumption markers |
Cockpit's ./.cockpit/[date].jsonl decision log uses an event schema compatible with disler's, so you can run both: Cockpit for active control, dashboards for retrospective analysis.
- Claude Code installed and authenticated
- (Optional) An LLM API key for accurate scenario classification — see "LLM provider config" below
No API key required — Cockpit uses heuristic classifier (keyword-based) by default and falls back to "plan-worthy" for ambiguous cases. LLM is optional for finer accuracy.
git clone https://github.com/cathyzhang0905/agent-cockpit.git
cd agent-cockpit
chmod +x hooks/*.pyThen run Claude Code with the plugin:
claude --plugin-dir /path/to/agent-cockpitIn Claude Code, run:
/cockpit status
Should show your current mode (default: auto).
/cockpit auto # Default — Haiku judge classifies each prompt
/cockpit on # Force plan-first for ALL tasks this session
/cockpit off # Disable Cockpit entirely (silent, no overhead)Mode persists in ~/.claude/cockpit.json across sessions.
Trivial task (e.g., "today's weather"):
- Haiku judge → skip
- Zero Cockpit interference. No plan, no dialogs, no brief.
Plan-worthy task (e.g., "research lindy.ai"):
- Haiku judge → plan
- Agent outputs
<cockpit-plan>block (informational; β-verified) - Before each WebSearch/WebFetch/Task/Edit/Write/NotebookEdit:
🛂 Cockpit Decision Point — WebSearch Agent wants to: query: "lindy.ai funding 2026" Approve this choice? (Choose 'no' to redirect agent.) - After each tool call, agent outputs
<cockpit-thought>:Got: Lindy raised $50M Series A March 2026 Insight: Mid-stage, not early — earlier hypothesis wrong Next: Fetch about page to confirm vision-thesis match - When making non-trivial assumptions:
<cockpit-assumption> Assuming: user_segment by last_30d_orders bucketed Why: No explicit segment field in data Impact: Distribution may not support clean tertiles </cockpit-assumption> - End of task — Cockpit Brief in framed box:
╭─ 🛂 Cockpit Brief ────────────────────────╮ │ Decisions made: 5 │ │ Outputs: ./lindy_profile.md │ │ Time taken: ~3 min │ │ Recommend review: Step 3 (vision mismatch │ │ with thesis is the key finding) │ ╰───────────────────────────────────────────╯
Every tool call → ./.cockpit/[date].jsonl:
Plus scenario_judge entries from Haiku classifier.
~/.claude/cockpit.json:
{
"mode": "auto",
"llm_providers": ["anthropic", "openai", "ollama"]
}mode—auto(default) /on(force plan-first all tasks) /off(disable Cockpit)llm_providers— order of LLM providers to try when heuristic is ambiguous. Default:["anthropic", "openai", "ollama"]
Hook behavior is configured in hooks/hooks.json (matchers + script paths). Edit the PreToolUse matcher to extend / restrict which tools trigger Cockpit dialogs.
Cockpit uses a 2-stage classifier for plan-worthy detection:
- Heuristic (always runs, no API call) — keyword + length rules cover ~80% of prompts
- LLM fallback (optional) — when heuristic is ambiguous, try LLM providers in your configured chain
| Provider | Env var needed | Model | Cost | Notes |
|---|---|---|---|---|
anthropic |
ANTHROPIC_API_KEY |
claude-haiku-4-5 |
~$0.0001/call | Same ecosystem as Claude Code |
openai |
OPENAI_API_KEY |
gpt-4o-mini |
~$0.0001/call | Most common dev key |
deepseek |
DEEPSEEK_API_KEY |
deepseek-chat (V3) |
very cheap | Strong CN/EN support |
qwen |
DASHSCOPE_API_KEY |
qwen-turbo |
very cheap | Alibaba, strong Chinese |
glm |
ZHIPU_API_KEY |
glm-4-flash |
FREE tier | 智谱 AI, no cost option |
kimi |
MOONSHOT_API_KEY |
kimi-latest |
very cheap | Moonshot AI, long context |
minimax |
MINIMAX_API_KEY (+ MINIMAX_GROUP_ID optional) |
abab6.5s-chat |
very cheap | Chinese-focused |
siliconflow |
SILICONFLOW_API_KEY |
Qwen/Qwen2.5-7B-Instruct (default) |
very cheap, free tier | 硅基流动,多开源模型聚合 |
ollama |
None (local) | llama3.2:3b (default) |
$0 | Fully offline + private |
Cockpit tries each provider in the order specified in llm_providers. First one that succeeds wins. If a provider has no API key or service is down, it's skipped silently.
Default chain: anthropic → openai → deepseek → qwen → glm → kimi → minimax → ollama. First with available auth wins. If none → conservative fallback to "plan" (Cockpit works without any LLM).
Recommended setup for Chinese users:
# Free option — GLM-4-Flash from 智谱 AI
# 1. Apply at https://open.bigmodel.cn/ (free tier, no credit card)
echo 'export ZHIPU_API_KEY="your-key"' >> ~/.zshrc
# Then config to prefer GLM:
# ~/.claude/cockpit.json
{ "llm_providers": ["glm", "ollama"] }Recommended setup for power users with multiple keys:
# Set whichever you have:
export ANTHROPIC_API_KEY="..." # Anthropic
export OPENAI_API_KEY="..." # OpenAI
export DEEPSEEK_API_KEY="..." # DeepSeek
export DASHSCOPE_API_KEY="..." # Qwen
export ZHIPU_API_KEY="..." # GLM
export MOONSHOT_API_KEY="..." # Kimi
export MINIMAX_API_KEY="..." # MiniMax
# Cockpit auto-uses the first one it finds in chain order{
"llm_providers": ["openai", "ollama"] // skip Anthropic, try OpenAI first
}{
"llm_providers": ["ollama"] // local-only, no cloud
}{
"llm_providers": [] // disable LLM entirely, heuristic + fallback only
}If you want fully local LLM with no API key:
# Install Ollama: https://ollama.com/
brew install ollama # macOS
# Start Ollama
ollama serve
# Pull a small model (in another terminal)
ollama pull llama3.2:3b
# or qwen2.5:3b for better Chinese supportSet custom Ollama config (optional):
export OLLAMA_HOST="http://localhost:11434" # default
export OLLAMA_MODEL="qwen2.5:3b" # better for Chinese prompts✅ Plan-First (β hard enforcement)
✅ Decision Markers (6 commit-direction tools)
✅ Thinking Checkpoint (every tool call, 3-field structured)
✅ Assumption Marker (data analysis, dual-track plan + mid-execution)
✅ Brief (2-tier adaptive)
✅ JSONL log (mirror disler schema)
✅ Haiku scenario filter + user override
✅ /cockpit on/off/auto slash command
v0.1 solves ~83% of original spec promise (FM1: 85%, FM2: 85%, FM3: 80%). See v01-completion-analysis.md for honest gap analysis.
- Override redirect (deny + user-provided alternative)
/cockpit statusfull with recent Haiku decisions- Pattern learning ("you've overridden type X 7 times, set rule?")
- Optional web dashboard or direct export to disler
- Cross-AI support (Cursor / Devin / OpenAI Agents SDK)
- Mid-task check-in (every N tool calls, agent pauses)
- Confidence scoring per tool call
See archived/architecture.md §7 for complete roadmap.
- spec.md — Product spec (what + why)
- architecture.md — Implementation architecture (how + why this way), includes 21-decision Decision Log
- user-research.md — Real user voice quotes from HN / V2EX validating the pain
- v01-completion-analysis.md — Honest assessment of what v0.1 solves vs spec promise
- disler/claude-code-hooks-multi-agent-observability — established the hook-based observability pattern (1.4k stars)
- simple10/agents-observe — inspired multi-agent hierarchy thinking (510 stars)
- Microsoft HAX Toolkit / Saleema Amershi's 18 Human-AI Interaction Guidelines — design language for human-AI teaming
- Anthropic "Building Effective Agents" — agent system design framework
Special call-out: Anthropic shipped Claude Code with a userPromptKeywords.ts regex that scans user messages for frustration words (per source leak analysis). They detect frustration. Cockpit prevents it.
MIT — see LICENSE.
Built by Cathy (@cathyzhang0905) — AI-native PM. Senior PM at DiDi, pivoting to AI-native product roles. I build tools at the intersection of agent oversight and personal information processing.
Interested in agent collaboration / vertical agents / AI PM workflows / eval methodology? Open an issue or reach me at @cathyzhang0905.
{ "timestamp": "2026-04-29T...", "session_id": "...", "tool_use_id": "...", "event_type": "tool_call", "tool_name": "WebSearch", "tool_input": {"query": "..."}, "output_preview": "...", "cockpit_decision": { "surfaced": true, "user_action": "approved" } }