Skip to content

14 infrastructure bugs found in SkillsBench Trial 1 audit — verifier, permissions, provisioning #192

@xdotli

Description

@xdotli

Summary

8-agent trajectory audit of SkillsBench Opus 4.7 with-skills Trial 1 (84 tasks) found 14 tasks with infrastructure bugs that caused false failures or prevented agent execution. These bugs reduce the measured score from a potential ~50-55% to 40.3%.

Model: claude-opus-4-7 | Agent: claude-agent-acp | Environment: Daytona | Concurrency: 2

Trial job dir: experiments/skillsbench/jobs/opus47-with-skills-t1/2026-04-24__16-00-14/


Category 1: Verifier Broken (5 tasks)

Agent did correct work but verifier failed to evaluate it. These are FALSE FAILs — reward=0.0 when it should be 1.0.

1. suricata-custom-exfil

  • Bug: Verifier pip install pytest installs to a different Python than /usr/bin/python3. Verifier runs python3 -m pytestNo module named pytest.
  • Agent work: Wrote correct Suricata rule, tested it, sid:1000001 fires on positive pcap.
  • Trial dir: suricata-custom-exfil__4473b0c0 (retry that ran)
  • Reproduce: Run the task, check verifier log for No module named pytest
  • Fix: Pin Python path in verifier test.sh, or use uv run pytest instead of python3 -m pytest

2. video-tutorial-indexer

  • Bug: Verifier installs pytest-json-ctrf==0.1.0 but the plugin registers differently — pytest -p ctrf fails with No module named 'ctrf'. Zero tests ran.
  • Agent work: Transcribed video with Whisper, wrote tutorial_index.json with 29 chapters.
  • Trial dir: video-tutorial-indexer__4d31a011
  • Fix: Update ctrf plugin import name in test.sh or remove -p ctrf flag

3. sales-pivot-analysis

  • Bug: Verifier test.sh only installs pytest + pytest-json-ctrf, but test_outputs.py imports openpyxl at line 9. ModuleNotFoundError. Zero tests ran.
  • Agent work: Parsed PDF + XLSX, wrote demographic_analysis.xlsx with pivot tables.
  • Trial dir: sales-pivot-analysis__f7896973
  • Fix: Add openpyxl to verifier test.sh pip install

4. pddl-tpp-planning

  • Bug: Verifier validate_plan() opens task01.pkl (reference solution pickle) which was never provisioned into /tests. FileNotFoundError.
  • Agent work: Read PDDL domain/problem files, wrote valid plan files.
  • Trial dir: pddl-tpp-planning__0ad04ccc
  • Fix: Include reference .pkl files in task tests/ directory

5. fix-build-google-auto

  • Bug: Verifier test.sh calls uv which is not installed in the sandbox. Dockerfile comment says "uv is installed" but never installs it. All 5 test lines fail with uv: command not found.
  • Agent work: Identified 4 cascading JDK 11 build issues, wrote 8 patches, got mvn verify -DskipTests=true to BUILD SUCCESS.
  • Trial dir: fix-build-google-auto__dc1de096
  • Fix: Add uv install to Dockerfile

Category 2: Sandbox Permissions (4 tasks)

Agent did correct work but couldn't write to required output paths. sandbox_user=agent can't write to root-owned directories.

6. syzkaller-ppdev-syzlang

  • Bug: /opt/syzkaller/sys/linux/ is owned by root. Agent (uid=1001) gets EACCES writing dev_ppdev.txt. Tried sudo (not installed), chmod (not permitted).
  • Agent work: Correctly identified ppdev ioctl interface, computed ioctl numbers, wrote proper syzlang description.
  • Trial dir: syzkaller-ppdev-syzlang__d928daed
  • Fix: chmod -R 777 /opt/syzkaller/sys/linux/ in Dockerfile or add agent to appropriate group

7. taxonomy-tree-merge

  • Bug: (a) Environment setup took 369s (vs typical 30-100s), eating 41% of 900s timeout. (b) /root/output/ owned by root, agent can't write.
  • Agent work: Processed 8,718 records through embedding + clustering pipeline. Completed successfully but hit PermissionError on output.
  • Trial dir: taxonomy-tree-merge__ce812f15
  • Fix: Create output dir with agent permissions in Dockerfile; investigate slow build

8. lean4-proof

  • Bug: Lean toolchain at /root/.elan/bin/ not accessible to sandbox_user=agent. Agent can't run lake or lean to type-check proofs.
  • Agent work: Wrote a Lean 4 proof (blind, couldn't verify).
  • Trial dir: lean4-proof__336be05e
  • Fix: Install Lean toolchain in a shared path or grant agent access to /root/.elan/

9. multilingual-video-dubbing

  • Bug: (a) /outputs directory root-owned, agent can't write. (b) Verifier itself crashes: No module named 'torch' — torch not installed for verifier.
  • Agent work: Generated TTS via Gemini API, did audio processing.
  • Trial dir: multilingual-video-dubbing__eb0c860f
  • Fix: chmod /outputs in Dockerfile; add torch to verifier deps

Category 3: Environment Provisioning (2 tasks)

Input data missing from sandbox — agent has nothing to work with.

10. organize-messy-files

  • Bug: Sandbox only had 3 files (DAMOP.pptx, paper_file_1.docx, paper_file_2.docx). Verifier expects ~100 PDFs across 5+ subject folders.
  • Trial dir: organize-messy-files__7283c7f8
  • Fix: Ensure all PDFs are included in the Docker build context / COPY step

11. azure-bgp-oscillation-route-leak

  • Bug: Dockerfile's inline DATAGEN Python script should generate JSON files into /app/data/. In sandbox, /app/data/ was empty.
  • Trial dir: azure-bgp-oscillation-route-leak__d4273147
  • Fix: Debug the DATAGEN heredoc in Dockerfile — may fail silently during Daytona image build

Category 4: Agent Launch / SDK Failures (3 tasks)

12. fix-build-agentops

  • Bug: Agent binary claude-agent-acp not found (rc=127). bugswarm/cached-images base image has PATH issues for Node.js/npm.
  • Trial dir: fix-build-agentops__a25f005e
  • Fix: Ensure Node.js 22 install works on bugswarm base image

13. latex-formula-extraction

  • Bug: Daytona SDK 0.168.0/0.169.0 SessionCommandLogsResponse pydantic validation error. Returns empty string instead of dict.
  • Trial dir: latex-formula-extraction__266e48be
  • Fix: Upstream Daytona SDK bug — report to Daytona

14. react-performance-debugging

  • Bug: Verifier pytest-playwright page fixture not found. 5 tests SKIPPED (async), 6 tests ERROR. Agent's React optimizations never evaluated.
  • Agent work: 21 tool calls, 7 code edits (parallelized API calls, added memoization, replaced lodash).
  • Trial dir: react-performance-debugging__f89a18bd
  • Fix: Fix conftest.py / pytest-playwright version compatibility in verifier

Additional Issues (not bugs, but score-impacting)

Disk space (2 tasks)

  • seismic-phase-picking and video-filler-word-remover: Sandbox 74% full at start. Agent wastes 5+ minutes clearing pip cache before doing real work. Combined with tight timeouts, causes failures.
  • Fix: Ensure sandbox images have sufficient free disk space

DinD Compose (4 tasks)

  • gh-repo-analytics, pedestrian-traffic-counting, pg-essay-to-audiobook, scheduling-email-assistant all fail with Process closed stdout (rc=None) because SSH pipes break through DinD layers.
  • Fix: PR on benchflow fix/dind-pty-acp branch — uses Daytona PTY WebSocket instead of SSH. Tested successfully on pedestrian-traffic-counting (reward=0.014, 7 tool calls).

Impact

Category Tasks Potential score gain
Verifier broken (false FAILs) 5 +5 passes
Sandbox permissions 4 +3-4 passes
Environment provisioning 2 +1-2 passes
Agent launch / SDK 3 +1-2 passes
Total 14 +10-13 passes

Current: 33.4/83 = 40.3%
Estimated if fixed: 43-46/83 = 52-55%

This would put Opus 4.7 above Gemini Flash (48.7%) on the leaderboard.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions