14 infrastructure bugs found in SkillsBench Trial 1 audit — verifier, permissions, provisioning

## Summary

8-agent trajectory audit of SkillsBench Opus 4.7 with-skills Trial 1 (84 tasks) found **14 tasks with infrastructure bugs** that caused false failures or prevented agent execution. These bugs reduce the measured score from a potential ~50-55% to 40.3%.

Model: claude-opus-4-7 | Agent: claude-agent-acp | Environment: Daytona | Concurrency: 2

Trial job dir: `experiments/skillsbench/jobs/opus47-with-skills-t1/2026-04-24__16-00-14/`

---

## Category 1: Verifier Broken (5 tasks)

Agent did correct work but verifier failed to evaluate it. These are FALSE FAILs — reward=0.0 when it should be 1.0.

### 1. suricata-custom-exfil
- **Bug:** Verifier `pip install pytest` installs to a different Python than `/usr/bin/python3`. Verifier runs `python3 -m pytest` → `No module named pytest`.
- **Agent work:** Wrote correct Suricata rule, tested it, sid:1000001 fires on positive pcap.
- **Trial dir:** `suricata-custom-exfil__4473b0c0` (retry that ran)
- **Reproduce:** Run the task, check verifier log for `No module named pytest`
- **Fix:** Pin Python path in verifier test.sh, or use `uv run pytest` instead of `python3 -m pytest`

### 2. video-tutorial-indexer
- **Bug:** Verifier installs `pytest-json-ctrf==0.1.0` but the plugin registers differently — `pytest -p ctrf` fails with `No module named 'ctrf'`. Zero tests ran.
- **Agent work:** Transcribed video with Whisper, wrote tutorial_index.json with 29 chapters.
- **Trial dir:** `video-tutorial-indexer__4d31a011`
- **Fix:** Update ctrf plugin import name in test.sh or remove `-p ctrf` flag

### 3. sales-pivot-analysis
- **Bug:** Verifier `test.sh` only installs `pytest` + `pytest-json-ctrf`, but `test_outputs.py` imports `openpyxl` at line 9. `ModuleNotFoundError`. Zero tests ran.
- **Agent work:** Parsed PDF + XLSX, wrote demographic_analysis.xlsx with pivot tables.
- **Trial dir:** `sales-pivot-analysis__f7896973`
- **Fix:** Add `openpyxl` to verifier test.sh pip install

### 4. pddl-tpp-planning
- **Bug:** Verifier `validate_plan()` opens `task01.pkl` (reference solution pickle) which was never provisioned into `/tests`. `FileNotFoundError`.
- **Agent work:** Read PDDL domain/problem files, wrote valid plan files.
- **Trial dir:** `pddl-tpp-planning__0ad04ccc`
- **Fix:** Include reference .pkl files in task tests/ directory

### 5. fix-build-google-auto
- **Bug:** Verifier test.sh calls `uv` which is not installed in the sandbox. Dockerfile comment says "uv is installed" but never installs it. All 5 test lines fail with `uv: command not found`.
- **Agent work:** Identified 4 cascading JDK 11 build issues, wrote 8 patches, got `mvn verify -DskipTests=true` to BUILD SUCCESS.
- **Trial dir:** `fix-build-google-auto__dc1de096`
- **Fix:** Add `uv` install to Dockerfile

---

## Category 2: Sandbox Permissions (4 tasks)

Agent did correct work but couldn't write to required output paths. sandbox_user=agent can't write to root-owned directories.

### 6. syzkaller-ppdev-syzlang
- **Bug:** `/opt/syzkaller/sys/linux/` is owned by root. Agent (uid=1001) gets EACCES writing `dev_ppdev.txt`. Tried sudo (not installed), chmod (not permitted).
- **Agent work:** Correctly identified ppdev ioctl interface, computed ioctl numbers, wrote proper syzlang description.
- **Trial dir:** `syzkaller-ppdev-syzlang__d928daed`
- **Fix:** `chmod -R 777 /opt/syzkaller/sys/linux/` in Dockerfile or add agent to appropriate group

### 7. taxonomy-tree-merge
- **Bug:** (a) Environment setup took 369s (vs typical 30-100s), eating 41% of 900s timeout. (b) `/root/output/` owned by root, agent can't write.
- **Agent work:** Processed 8,718 records through embedding + clustering pipeline. Completed successfully but hit PermissionError on output.
- **Trial dir:** `taxonomy-tree-merge__ce812f15`
- **Fix:** Create output dir with agent permissions in Dockerfile; investigate slow build

### 8. lean4-proof
- **Bug:** Lean toolchain at `/root/.elan/bin/` not accessible to sandbox_user=agent. Agent can't run `lake` or `lean` to type-check proofs.
- **Agent work:** Wrote a Lean 4 proof (blind, couldn't verify).
- **Trial dir:** `lean4-proof__336be05e`
- **Fix:** Install Lean toolchain in a shared path or grant agent access to `/root/.elan/`

### 9. multilingual-video-dubbing
- **Bug:** (a) `/outputs` directory root-owned, agent can't write. (b) Verifier itself crashes: `No module named 'torch'` — torch not installed for verifier.
- **Agent work:** Generated TTS via Gemini API, did audio processing.
- **Trial dir:** `multilingual-video-dubbing__eb0c860f`
- **Fix:** chmod /outputs in Dockerfile; add torch to verifier deps

---

## Category 3: Environment Provisioning (2 tasks)

Input data missing from sandbox — agent has nothing to work with.

### 10. organize-messy-files
- **Bug:** Sandbox only had 3 files (DAMOP.pptx, paper_file_1.docx, paper_file_2.docx). Verifier expects ~100 PDFs across 5+ subject folders.
- **Trial dir:** `organize-messy-files__7283c7f8`
- **Fix:** Ensure all PDFs are included in the Docker build context / COPY step

### 11. azure-bgp-oscillation-route-leak
- **Bug:** Dockerfile's inline DATAGEN Python script should generate JSON files into `/app/data/`. In sandbox, `/app/data/` was empty.
- **Trial dir:** `azure-bgp-oscillation-route-leak__d4273147`
- **Fix:** Debug the DATAGEN heredoc in Dockerfile — may fail silently during Daytona image build

---

## Category 4: Agent Launch / SDK Failures (3 tasks)

### 12. fix-build-agentops
- **Bug:** Agent binary `claude-agent-acp` not found (rc=127). bugswarm/cached-images base image has PATH issues for Node.js/npm.
- **Trial dir:** `fix-build-agentops__a25f005e`
- **Fix:** Ensure Node.js 22 install works on bugswarm base image

### 13. latex-formula-extraction
- **Bug:** Daytona SDK 0.168.0/0.169.0 `SessionCommandLogsResponse` pydantic validation error. Returns empty string instead of dict.
- **Trial dir:** `latex-formula-extraction__266e48be`
- **Fix:** Upstream Daytona SDK bug — report to Daytona

### 14. react-performance-debugging
- **Bug:** Verifier pytest-playwright `page` fixture not found. 5 tests SKIPPED (async), 6 tests ERROR. Agent's React optimizations never evaluated.
- **Agent work:** 21 tool calls, 7 code edits (parallelized API calls, added memoization, replaced lodash).
- **Trial dir:** `react-performance-debugging__f89a18bd`
- **Fix:** Fix conftest.py / pytest-playwright version compatibility in verifier

---

## Additional Issues (not bugs, but score-impacting)

### Disk space (2 tasks)
- **seismic-phase-picking** and **video-filler-word-remover**: Sandbox 74% full at start. Agent wastes 5+ minutes clearing pip cache before doing real work. Combined with tight timeouts, causes failures.
- **Fix:** Ensure sandbox images have sufficient free disk space

### DinD Compose (4 tasks)
- gh-repo-analytics, pedestrian-traffic-counting, pg-essay-to-audiobook, scheduling-email-assistant all fail with `Process closed stdout (rc=None)` because SSH pipes break through DinD layers.
- **Fix:** PR on benchflow `fix/dind-pty-acp` branch — uses Daytona PTY WebSocket instead of SSH. Tested successfully on pedestrian-traffic-counting (reward=0.014, 7 tool calls).

---

## Impact

| Category | Tasks | Potential score gain |
|----------|-------|---------------------|
| Verifier broken (false FAILs) | 5 | +5 passes |
| Sandbox permissions | 4 | +3-4 passes |
| Environment provisioning | 2 | +1-2 passes |
| Agent launch / SDK | 3 | +1-2 passes |
| **Total** | **14** | **+10-13 passes** |

Current: 33.4/83 = 40.3%
Estimated if fixed: 43-46/83 = **52-55%**

This would put Opus 4.7 above Gemini Flash (48.7%) on the leaderboard.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

14 infrastructure bugs found in SkillsBench Trial 1 audit — verifier, permissions, provisioning #192

Summary

Category 1: Verifier Broken (5 tasks)

1. suricata-custom-exfil

2. video-tutorial-indexer

3. sales-pivot-analysis

4. pddl-tpp-planning

5. fix-build-google-auto

Category 2: Sandbox Permissions (4 tasks)

6. syzkaller-ppdev-syzlang

7. taxonomy-tree-merge

8. lean4-proof

9. multilingual-video-dubbing

Category 3: Environment Provisioning (2 tasks)

10. organize-messy-files

11. azure-bgp-oscillation-route-leak

Category 4: Agent Launch / SDK Failures (3 tasks)

12. fix-build-agentops

13. latex-formula-extraction

14. react-performance-debugging

Additional Issues (not bugs, but score-impacting)

Disk space (2 tasks)

DinD Compose (4 tasks)

Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Category	Tasks	Potential score gain
Verifier broken (false FAILs)	5	+5 passes
Sandbox permissions	4	+3-4 passes
Environment provisioning	2	+1-2 passes
Agent launch / SDK	3	+1-2 passes
Total	14	+10-13 passes

14 infrastructure bugs found in SkillsBench Trial 1 audit — verifier, permissions, provisioning #192

Description

Summary

Category 1: Verifier Broken (5 tasks)

1. suricata-custom-exfil

2. video-tutorial-indexer

3. sales-pivot-analysis

4. pddl-tpp-planning

5. fix-build-google-auto

Category 2: Sandbox Permissions (4 tasks)

6. syzkaller-ppdev-syzlang

7. taxonomy-tree-merge

8. lean4-proof

9. multilingual-video-dubbing

Category 3: Environment Provisioning (2 tasks)

10. organize-messy-files

11. azure-bgp-oscillation-route-leak

Category 4: Agent Launch / SDK Failures (3 tasks)

12. fix-build-agentops

13. latex-formula-extraction

14. react-performance-debugging

Additional Issues (not bugs, but score-impacting)

Disk space (2 tasks)

DinD Compose (4 tasks)

Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions