Summary
8-agent trajectory audit of SkillsBench Opus 4.7 with-skills Trial 1 (84 tasks) found 14 tasks with infrastructure bugs that caused false failures or prevented agent execution. These bugs reduce the measured score from a potential ~50-55% to 40.3%.
Model: claude-opus-4-7 | Agent: claude-agent-acp | Environment: Daytona | Concurrency: 2
Trial job dir: experiments/skillsbench/jobs/opus47-with-skills-t1/2026-04-24__16-00-14/
Category 1: Verifier Broken (5 tasks)
Agent did correct work but verifier failed to evaluate it. These are FALSE FAILs — reward=0.0 when it should be 1.0.
1. suricata-custom-exfil
- Bug: Verifier
pip install pytest installs to a different Python than /usr/bin/python3. Verifier runs python3 -m pytest → No module named pytest.
- Agent work: Wrote correct Suricata rule, tested it, sid:1000001 fires on positive pcap.
- Trial dir:
suricata-custom-exfil__4473b0c0 (retry that ran)
- Reproduce: Run the task, check verifier log for
No module named pytest
- Fix: Pin Python path in verifier test.sh, or use
uv run pytest instead of python3 -m pytest
2. video-tutorial-indexer
- Bug: Verifier installs
pytest-json-ctrf==0.1.0 but the plugin registers differently — pytest -p ctrf fails with No module named 'ctrf'. Zero tests ran.
- Agent work: Transcribed video with Whisper, wrote tutorial_index.json with 29 chapters.
- Trial dir:
video-tutorial-indexer__4d31a011
- Fix: Update ctrf plugin import name in test.sh or remove
-p ctrf flag
3. sales-pivot-analysis
- Bug: Verifier
test.sh only installs pytest + pytest-json-ctrf, but test_outputs.py imports openpyxl at line 9. ModuleNotFoundError. Zero tests ran.
- Agent work: Parsed PDF + XLSX, wrote demographic_analysis.xlsx with pivot tables.
- Trial dir:
sales-pivot-analysis__f7896973
- Fix: Add
openpyxl to verifier test.sh pip install
4. pddl-tpp-planning
- Bug: Verifier
validate_plan() opens task01.pkl (reference solution pickle) which was never provisioned into /tests. FileNotFoundError.
- Agent work: Read PDDL domain/problem files, wrote valid plan files.
- Trial dir:
pddl-tpp-planning__0ad04ccc
- Fix: Include reference .pkl files in task tests/ directory
5. fix-build-google-auto
- Bug: Verifier test.sh calls
uv which is not installed in the sandbox. Dockerfile comment says "uv is installed" but never installs it. All 5 test lines fail with uv: command not found.
- Agent work: Identified 4 cascading JDK 11 build issues, wrote 8 patches, got
mvn verify -DskipTests=true to BUILD SUCCESS.
- Trial dir:
fix-build-google-auto__dc1de096
- Fix: Add
uv install to Dockerfile
Category 2: Sandbox Permissions (4 tasks)
Agent did correct work but couldn't write to required output paths. sandbox_user=agent can't write to root-owned directories.
6. syzkaller-ppdev-syzlang
- Bug:
/opt/syzkaller/sys/linux/ is owned by root. Agent (uid=1001) gets EACCES writing dev_ppdev.txt. Tried sudo (not installed), chmod (not permitted).
- Agent work: Correctly identified ppdev ioctl interface, computed ioctl numbers, wrote proper syzlang description.
- Trial dir:
syzkaller-ppdev-syzlang__d928daed
- Fix:
chmod -R 777 /opt/syzkaller/sys/linux/ in Dockerfile or add agent to appropriate group
7. taxonomy-tree-merge
- Bug: (a) Environment setup took 369s (vs typical 30-100s), eating 41% of 900s timeout. (b)
/root/output/ owned by root, agent can't write.
- Agent work: Processed 8,718 records through embedding + clustering pipeline. Completed successfully but hit PermissionError on output.
- Trial dir:
taxonomy-tree-merge__ce812f15
- Fix: Create output dir with agent permissions in Dockerfile; investigate slow build
8. lean4-proof
- Bug: Lean toolchain at
/root/.elan/bin/ not accessible to sandbox_user=agent. Agent can't run lake or lean to type-check proofs.
- Agent work: Wrote a Lean 4 proof (blind, couldn't verify).
- Trial dir:
lean4-proof__336be05e
- Fix: Install Lean toolchain in a shared path or grant agent access to
/root/.elan/
9. multilingual-video-dubbing
- Bug: (a)
/outputs directory root-owned, agent can't write. (b) Verifier itself crashes: No module named 'torch' — torch not installed for verifier.
- Agent work: Generated TTS via Gemini API, did audio processing.
- Trial dir:
multilingual-video-dubbing__eb0c860f
- Fix: chmod /outputs in Dockerfile; add torch to verifier deps
Category 3: Environment Provisioning (2 tasks)
Input data missing from sandbox — agent has nothing to work with.
10. organize-messy-files
- Bug: Sandbox only had 3 files (DAMOP.pptx, paper_file_1.docx, paper_file_2.docx). Verifier expects ~100 PDFs across 5+ subject folders.
- Trial dir:
organize-messy-files__7283c7f8
- Fix: Ensure all PDFs are included in the Docker build context / COPY step
11. azure-bgp-oscillation-route-leak
- Bug: Dockerfile's inline DATAGEN Python script should generate JSON files into
/app/data/. In sandbox, /app/data/ was empty.
- Trial dir:
azure-bgp-oscillation-route-leak__d4273147
- Fix: Debug the DATAGEN heredoc in Dockerfile — may fail silently during Daytona image build
Category 4: Agent Launch / SDK Failures (3 tasks)
12. fix-build-agentops
- Bug: Agent binary
claude-agent-acp not found (rc=127). bugswarm/cached-images base image has PATH issues for Node.js/npm.
- Trial dir:
fix-build-agentops__a25f005e
- Fix: Ensure Node.js 22 install works on bugswarm base image
13. latex-formula-extraction
- Bug: Daytona SDK 0.168.0/0.169.0
SessionCommandLogsResponse pydantic validation error. Returns empty string instead of dict.
- Trial dir:
latex-formula-extraction__266e48be
- Fix: Upstream Daytona SDK bug — report to Daytona
14. react-performance-debugging
- Bug: Verifier pytest-playwright
page fixture not found. 5 tests SKIPPED (async), 6 tests ERROR. Agent's React optimizations never evaluated.
- Agent work: 21 tool calls, 7 code edits (parallelized API calls, added memoization, replaced lodash).
- Trial dir:
react-performance-debugging__f89a18bd
- Fix: Fix conftest.py / pytest-playwright version compatibility in verifier
Additional Issues (not bugs, but score-impacting)
Disk space (2 tasks)
- seismic-phase-picking and video-filler-word-remover: Sandbox 74% full at start. Agent wastes 5+ minutes clearing pip cache before doing real work. Combined with tight timeouts, causes failures.
- Fix: Ensure sandbox images have sufficient free disk space
DinD Compose (4 tasks)
- gh-repo-analytics, pedestrian-traffic-counting, pg-essay-to-audiobook, scheduling-email-assistant all fail with
Process closed stdout (rc=None) because SSH pipes break through DinD layers.
- Fix: PR on benchflow
fix/dind-pty-acp branch — uses Daytona PTY WebSocket instead of SSH. Tested successfully on pedestrian-traffic-counting (reward=0.014, 7 tool calls).
Impact
| Category |
Tasks |
Potential score gain |
| Verifier broken (false FAILs) |
5 |
+5 passes |
| Sandbox permissions |
4 |
+3-4 passes |
| Environment provisioning |
2 |
+1-2 passes |
| Agent launch / SDK |
3 |
+1-2 passes |
| Total |
14 |
+10-13 passes |
Current: 33.4/83 = 40.3%
Estimated if fixed: 43-46/83 = 52-55%
This would put Opus 4.7 above Gemini Flash (48.7%) on the leaderboard.
Summary
8-agent trajectory audit of SkillsBench Opus 4.7 with-skills Trial 1 (84 tasks) found 14 tasks with infrastructure bugs that caused false failures or prevented agent execution. These bugs reduce the measured score from a potential ~50-55% to 40.3%.
Model: claude-opus-4-7 | Agent: claude-agent-acp | Environment: Daytona | Concurrency: 2
Trial job dir:
experiments/skillsbench/jobs/opus47-with-skills-t1/2026-04-24__16-00-14/Category 1: Verifier Broken (5 tasks)
Agent did correct work but verifier failed to evaluate it. These are FALSE FAILs — reward=0.0 when it should be 1.0.
1. suricata-custom-exfil
pip install pytestinstalls to a different Python than/usr/bin/python3. Verifier runspython3 -m pytest→No module named pytest.suricata-custom-exfil__4473b0c0(retry that ran)No module named pytestuv run pytestinstead ofpython3 -m pytest2. video-tutorial-indexer
pytest-json-ctrf==0.1.0but the plugin registers differently —pytest -p ctrffails withNo module named 'ctrf'. Zero tests ran.video-tutorial-indexer__4d31a011-p ctrfflag3. sales-pivot-analysis
test.shonly installspytest+pytest-json-ctrf, buttest_outputs.pyimportsopenpyxlat line 9.ModuleNotFoundError. Zero tests ran.sales-pivot-analysis__f7896973openpyxlto verifier test.sh pip install4. pddl-tpp-planning
validate_plan()openstask01.pkl(reference solution pickle) which was never provisioned into/tests.FileNotFoundError.pddl-tpp-planning__0ad04ccc5. fix-build-google-auto
uvwhich is not installed in the sandbox. Dockerfile comment says "uv is installed" but never installs it. All 5 test lines fail withuv: command not found.mvn verify -DskipTests=trueto BUILD SUCCESS.fix-build-google-auto__dc1de096uvinstall to DockerfileCategory 2: Sandbox Permissions (4 tasks)
Agent did correct work but couldn't write to required output paths. sandbox_user=agent can't write to root-owned directories.
6. syzkaller-ppdev-syzlang
/opt/syzkaller/sys/linux/is owned by root. Agent (uid=1001) gets EACCES writingdev_ppdev.txt. Tried sudo (not installed), chmod (not permitted).syzkaller-ppdev-syzlang__d928daedchmod -R 777 /opt/syzkaller/sys/linux/in Dockerfile or add agent to appropriate group7. taxonomy-tree-merge
/root/output/owned by root, agent can't write.taxonomy-tree-merge__ce812f158. lean4-proof
/root/.elan/bin/not accessible to sandbox_user=agent. Agent can't runlakeorleanto type-check proofs.lean4-proof__336be05e/root/.elan/9. multilingual-video-dubbing
/outputsdirectory root-owned, agent can't write. (b) Verifier itself crashes:No module named 'torch'— torch not installed for verifier.multilingual-video-dubbing__eb0c860fCategory 3: Environment Provisioning (2 tasks)
Input data missing from sandbox — agent has nothing to work with.
10. organize-messy-files
organize-messy-files__7283c7f811. azure-bgp-oscillation-route-leak
/app/data/. In sandbox,/app/data/was empty.azure-bgp-oscillation-route-leak__d4273147Category 4: Agent Launch / SDK Failures (3 tasks)
12. fix-build-agentops
claude-agent-acpnot found (rc=127). bugswarm/cached-images base image has PATH issues for Node.js/npm.fix-build-agentops__a25f005e13. latex-formula-extraction
SessionCommandLogsResponsepydantic validation error. Returns empty string instead of dict.latex-formula-extraction__266e48be14. react-performance-debugging
pagefixture not found. 5 tests SKIPPED (async), 6 tests ERROR. Agent's React optimizations never evaluated.react-performance-debugging__f89a18bdAdditional Issues (not bugs, but score-impacting)
Disk space (2 tasks)
DinD Compose (4 tasks)
Process closed stdout (rc=None)because SSH pipes break through DinD layers.fix/dind-pty-acpbranch — uses Daytona PTY WebSocket instead of SSH. Tested successfully on pedestrian-traffic-counting (reward=0.014, 7 tool calls).Impact
Current: 33.4/83 = 40.3%
Estimated if fixed: 43-46/83 = 52-55%
This would put Opus 4.7 above Gemini Flash (48.7%) on the leaderboard.