Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/docs.json
Original file line number Diff line number Diff line change
Expand Up @@ -202,7 +202,8 @@
"zh/benchmarks/osworld",
"zh/benchmarks/gaia",
"zh/benchmarks/tau-bench",
"zh/benchmarks/cybench"
"zh/benchmarks/cybench",
"zh/benchmarks/cybergym"
]
},
{
Expand Down
189 changes: 189 additions & 0 deletions docs/zh/benchmarks/cybergym.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
---
title: "CyberGym"
description: "使用 `qitos.benchmark.cybergym`、外部同步的 `cybergym_agent` 和 QitOS trace,在 CyberGym 上运行安全漏洞 PoC benchmark。"
---

CyberGym 是一个面向漏洞触发与 PoC 生成的安全 benchmark。任务通常会提供漏洞版本代码、描述信息和一个 `submit.sh`,agent 需要构造输入,让 vulnerable 版本触发异常,并最终通过 patched 版本验证。

这次集成把 CyberGym 放进了 QitOS 的正式 benchmark 结构里:

- `qitos.benchmark.cybergym`
- `qitos.recipes.benchmarks.cybergym`
- `examples/benchmarks/cybergym_eval.py`

同时把运行产物统一放到 `runs/cybergym/` 下,便于直接用 `qita` 查看轨迹。

注意:这个 PR 不直接提交 `cybergym_agent` 代码。运行前需要维护者手动把 `cybergym_agent` 仓库同步到 `qitos/benchmark/cybergym/agent/`。

## 集成结构

QitOS 侧的主要文件:

- `qitos/benchmark/cybergym/adapter.py`
- `qitos/benchmark/cybergym/runtime.py`
- `qitos/benchmark/cybergym/evaluator.py`
- `qitos/benchmark/cybergym/scorer.py`
- `qitos/benchmark/cybergym/runner.py`
- `qitos/recipes/benchmarks/cybergym.py`
- `examples/benchmarks/cybergym_eval.py`

职责划分:

- `adapter.py`:把 CyberGym 的 `task_id` 转成 QitOS `Task`
- `runtime.py`:生成 task 目录并附加运行时元信息
- `runner.py`:调起外部同步进来的 `cybergym_agent`、写 trace、返回标准结果
- `recipe`:可复现 baseline
- `example`:最薄的一层命令入口

## 同步 `cybergym_agent`

运行前,需要先把 `cybergym_agent` 仓库同步到 QitOS 里。下面命令在 QitOS 仓库根目录执行:

```bash
mkdir -p qitos/benchmark/cybergym/agent
rsync -a \
--exclude .git \
--exclude __pycache__ \
--exclude test_agent.py \
../cybergym_agent/ \
qitos/benchmark/cybergym/agent/
```

如果没有这一步,`qitos.benchmark.cybergym.runner` 会在运行时直接报错,提示先复制 `cybergym_agent`。

## 准备工作

<Steps>
<Step title="准备 CyberGym 数据目录">
确保可以访问 `cybergym_data/data`,例如放在 QitOS 同级目录的 `../cybergym/cybergym_data/data`。
</Step>
<Step title="启动 CyberGym server">
需要一台能运行 Docker 镜像的 CyberGym server。示例:

```bash
cd ../cybergym
python -m cybergym.server \
--host 127.0.0.1 \
--port 8669 \
--log_dir ../qitos/runs/cybergym/server_poc \
--db_path ../qitos/runs/cybergym/server_poc/poc.db
```
</Step>
<Step title="设置模型与 verify key">
```bash
export CYBERGYM_CLAUDE_AUTH_TOKEN="your-model-key"
export CYBERGYM_API_KEY="your-verify-key"
```
</Step>
</Steps>

## 单任务运行

从 QitOS 仓库根目录运行:

```bash
python examples/benchmarks/cybergym_eval.py \
--task-id arvo:1065 \
--data-dir ../cybergym/cybergym_data/data \
--out-dir runs/cybergym/workspace/arvo_1065 \
--server http://127.0.0.1:8669 \
--difficulty level1 \
--model-name GLM-5.1-sii \
--api-key "$CYBERGYM_CLAUDE_AUTH_TOKEN" \
--base-url https://your-openai-compatible-endpoint/v1 \
--max-steps 30 \
--trace-logdir runs/cybergym/traces
```

## 批量跑 100 个任务

假设 `tasks.txt` 每行一个 `task_id`:

```text
arvo:1065
arvo:3938
oss-fuzz:42535201
...
```

从 QitOS 仓库根目录顺序跑:

```bash
export TASKS_FILE=./tasks.txt
export SERVER=http://your-cybergym-server:8669

while read -r TASK_ID; do
[ -z "$TASK_ID" ] && continue
SLUG="${TASK_ID/:/_}"
echo "===== START $TASK_ID ====="
python examples/benchmarks/cybergym_eval.py \
--task-id "$TASK_ID" \
--data-dir ../cybergym/cybergym_data/data \
--out-dir "runs/cybergym/workspace/$SLUG" \
--server "$SERVER" \
--difficulty level1 \
--model-name GLM-5.1-sii \
--api-key "$CYBERGYM_CLAUDE_AUTH_TOKEN" \
--base-url https://your-openai-compatible-endpoint/v1 \
--max-steps 30 \
--trace-logdir runs/cybergym/traces
echo "===== END $TASK_ID ====="
done < "$TASKS_FILE" | tee runs/cybergym/run-100.log
```

小并发时可以自行改成 `xargs -P 2` 或 `xargs -P 4`,但建议先验证模型端和 server 端的稳定性。

## 批量 verify

CyberGym 的 public `/submit-vul` 只返回 vulnerable 侧结果。完整 benchmark 判定还需要 fix 侧 verify。

在 `cybergym` 仓库根目录运行:

```bash
python scripts/verify_batch_results.py \
--logs_dir ../qitos/runs/cybergym/logs \
--server http://your-cybergym-server:8669 \
--pocdb_path ../qitos/runs/cybergym/server_poc/poc.db \
--summary_json ../qitos/runs/cybergym/verify-summary.json
```

只看当前数据库状态、不真正发 verify:

```bash
python scripts/verify_batch_results.py \
--logs_dir ../qitos/runs/cybergym/logs \
--server http://your-cybergym-server:8669 \
--pocdb_path ../qitos/runs/cybergym/server_poc/poc.db \
--skip_verify
```

## trace 与产物

运行后主要产物在:

- `runs/cybergym/workspace/`
- `runs/cybergym/server_poc/`
- `runs/cybergym/traces/`

查看轨迹:

```bash
qita board --logdir runs/cybergym/traces
```

QitOS trace 会写出:

- `manifest.json`
- `events.jsonl`
- `steps.jsonl`

## 当前状态

这次集成已经验证了:

- CyberGym task 能转成 QitOS `Task`
- benchmark family 已注册到 `qitos.benchmark`
- recipe 和 thin example 能正常调用同一条 runner
- 真实 smoke 可以生成 task、初始化 `GLM-5.1-sii`、写出 QitOS trace

当前已知限制是模型协议还没有完全对齐。`GLM-5.1-sii` 目前更倾向输出 `<tool_call>...` 风格内容,而当前 agent 还走 `JsonDecisionParser` 期望纯 JSON,因此在短 smoke 里会停在 parser error。这是下一步要处理的模型协议适配问题,不是 CyberGym benchmark 结构接入问题。
3 changes: 2 additions & 1 deletion docs/zh/benchmarks/overview.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "基准测试"
description: "通过统一的官方 QitOS benchmark 路径运行 desktop-starter、OSWorld、GAIA、Tau-Bench 与 CyBench,并产出统一结果与 trace artifacts。"
description: "通过统一的官方 QitOS benchmark 路径运行 desktop-starter、OSWorld、GAIA、Tau-Bench、CyBenchCyberGym,并产出统一结果与 trace artifacts。"
---

在 QitOS 里,benchmark 不是另一套平行 runtime,而是同一条 agent runtime 叙事的延伸。
Expand All @@ -23,6 +23,7 @@ description: "通过统一的官方 QitOS benchmark 路径运行 desktop-starter
| [GAIA](/zh/benchmarks/gaia) | 通用 AI assistant 任务 | Exact match |
| [Tau-Bench](/zh/benchmarks/tau-bench) | Tool-agent-user 交互 | Reward / pass^k |
| [CyBench](/zh/benchmarks/cybench) | CTF 风格安全评测 | Guided subtask score |
| [CyberGym](/zh/benchmarks/cybergym) | 漏洞 PoC 生成与 differential verify | `vul_exit_code != 0` 且 `fix_exit_code == 0` |

## 官方 benchmark 入口

Expand Down
17 changes: 17 additions & 0 deletions examples/benchmarks/cybergym_eval.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
"""Thin CyberGym benchmark entrypoint backed by the canonical recipe."""

from qitos.recipes.benchmarks.cybergym import (
main,
run_cybergym_agent_task,
run_cybergym_recipe_task,
)

__all__ = [
"main",
"run_cybergym_agent_task",
"run_cybergym_recipe_task",
]


if __name__ == "__main__":
raise SystemExit(main())
12 changes: 12 additions & 0 deletions qitos/benchmark/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,12 @@
"load_cybench_tasks": (".cybench", "load_cybench_tasks"),
"run_cybench_task": (".cybench", "run_cybench_task"),
"score_cybench_submission": (".cybench", "score_cybench_submission"),
"CyberGymBenchmarkAdapter": (".cybergym", "CyberGymBenchmarkAdapter"),
"CyberGymEvaluator": (".cybergym", "CyberGymEvaluator"),
"CyberGymRuntimeHook": (".cybergym", "CyberGymRuntimeHook"),
"CyberGymScorer": (".cybergym", "CyberGymScorer"),
"load_cybergym_tasks": (".cybergym", "load_cybergym_tasks"),
"run_cybergym_task": (".cybergym", "run_cybergym_task"),
"GaiaAdapter": (".gaia", "GaiaAdapter"),
"GaiaEvaluator": (".gaia", "GaiaEvaluator"),
"GaiaRuntimeHook": (".gaia", "GaiaRuntimeHook"),
Expand Down Expand Up @@ -100,6 +106,12 @@ def __getattr__(name: str):
"run_cybench_task",
"score_cybench_submission",
"load_cybench_tasks",
"CyberGymBenchmarkAdapter",
"CyberGymEvaluator",
"CyberGymRuntimeHook",
"CyberGymScorer",
"load_cybergym_tasks",
"run_cybergym_task",
"GaiaAdapter",
"GaiaEvaluator",
"GaiaRuntimeHook",
Expand Down
43 changes: 43 additions & 0 deletions qitos/benchmark/cybergym/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
"""CyberGym benchmark integration."""

from __future__ import annotations

import importlib

_LAZY_ATTRS = {
"CyberGymBenchmarkAdapter": (".adapter", "CyberGymBenchmarkAdapter"),
"load_cybergym_tasks": (".adapter", "load_cybergym_tasks"),
"task_slug": (".adapter", "task_slug"),
"CyberGymEvaluator": (".evaluator", "CyberGymEvaluator"),
"CyberGymRuntimeHook": (".runtime", "CyberGymRuntimeHook"),
"prepare_task_dir": (".runtime", "prepare_task_dir"),
"CyberGymScorer": (".scorer", "CyberGymScorer"),
"make_trace_writer": (".runner", "make_trace_writer"),
"run_cybergym_agent_task": (".runner", "run_cybergym_agent_task"),
"run_cybergym_task": (".runner", "run_cybergym_task"),
}


def __getattr__(name: str):
target = _LAZY_ATTRS.get(name)
if target is None:
raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
module_name, attr_name = target
module = importlib.import_module(module_name, __name__)
value = getattr(module, attr_name)
globals()[name] = value
return value


__all__ = [
"CyberGymBenchmarkAdapter",
"CyberGymEvaluator",
"CyberGymRuntimeHook",
"CyberGymScorer",
"load_cybergym_tasks",
"make_trace_writer",
"prepare_task_dir",
"run_cybergym_agent_task",
"run_cybergym_task",
"task_slug",
]
85 changes: 85 additions & 0 deletions qitos/benchmark/cybergym/adapter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
"""CyberGym benchmark task adapter."""

from __future__ import annotations

from typing import Any, Iterable, Mapping, Optional

from qitos.core import Task, TaskBudget

from ..base import BenchmarkAdapter, BenchmarkSource


def task_slug(task_id: str) -> str:
return str(task_id).replace(":", "_")


class CyberGymBenchmarkAdapter(BenchmarkAdapter):
"""Convert CyberGym task ids into QitOS Task objects."""

def __init__(self, *, difficulty: str = "level1") -> None:
self.source = BenchmarkSource(name="cybergym", split=str(difficulty))
self.difficulty = str(difficulty)

def load_records(
self,
*,
task_ids: Iterable[str],
limit: Optional[int] = None,
) -> list[dict[str, Any]]:
records: list[dict[str, Any]] = []
for raw_task_id in task_ids:
task_id = str(raw_task_id).strip()
if not task_id:
continue
records.append({"task_id": task_id})
if limit is not None and len(records) >= int(limit):
break
return records

def to_task(
self,
record: Mapping[str, Any],
*,
split: str | None = None,
idx: int = 0,
) -> Task:
task_id = str(record.get("task_id") or record.get("id") or "").strip()
if not task_id:
raise ValueError("CyberGym record requires task_id")
difficulty = str(split or record.get("difficulty") or self.difficulty)
return Task(
id=task_id,
objective=f"Solve CyberGym task {task_id}.",
inputs={"task_id": task_id, "difficulty": difficulty},
budget=TaskBudget(max_steps=int(record.get("max_steps", 30))),
metadata={
"benchmark": "cybergym",
"split": difficulty,
"task_index": int(idx),
"raw_record": dict(record),
},
)

def to_tasks(
self,
records: Iterable[Mapping[str, Any]],
split: str,
limit: Optional[int] = None,
) -> list[Task]:
tasks: list[Task] = []
for idx, record in enumerate(records):
tasks.append(self.to_task(record, split=split, idx=idx))
if limit is not None and len(tasks) >= int(limit):
break
return tasks


def load_cybergym_tasks(
*,
task_ids: Iterable[str],
difficulty: str = "level1",
limit: Optional[int] = None,
) -> list[Task]:
adapter = CyberGymBenchmarkAdapter(difficulty=difficulty)
records = adapter.load_records(task_ids=task_ids, limit=limit)
return adapter.to_tasks(records, split=difficulty, limit=limit)
Loading