Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,7 @@ examples/qitos_tau_workspace/
qitos_cybench_workspace/
examples/qitos_cybench_workspace/
examples/playground/
qitos/benchmark/cybergym/agent/

# Auto-generated API reference pages (built by docs hook)
docs/reference/api_generated/
Expand Down
77 changes: 77 additions & 0 deletions docs/benchmarks/cybergym.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# CyberGym

QitOS integrates CyberGym as a benchmark family with a dedicated agent runtime under `qitos/benchmark/cybergym/`.

## Current Integration Notes

The current integration is optimized for long-running PoC-generation tasks and keeps the benchmark-specific logic split across:

- `qitos/benchmark/cybergym/runtime.py`
- `qitos/benchmark/cybergym/runner.py`
- `qitos/recipes/benchmarks/cybergym.py`
- `qitos/benchmark/cybergym/agent/`

## Important Runtime Behavior

### 1. Task workspace layout

Single-task recipe runs now place prepared task files under:

```text
<out_dir>/workspace/<task_slug>/
```

instead of writing task files directly into `<out_dir>`.

This keeps:

- benchmark-level files such as `run.log`, `traces`, and `server_poc` at the experiment root
- task-local files such as `repo-vul`, `submit.sh`, `.cybergym`, and generated PoCs inside the task workspace

### 2. Model transport defaults

OpenAI-compatible harness presets now default to:

- request timeout: `120s`
- lightweight retry on transient request failures, including timeout cases

This is handled in the shared OpenAI-compatible model layer rather than only in the benchmark wrapper.

### 3. Tool-result budget

CyberGym benchmark runs use a larger tool-result budget than the generic engine default.

The current CyberGym runner sets:

```text
tool_result_max_chars = 60000
```

This reduces destructive truncation for long `READ` and `BASH` outputs during exploit-development tasks.

## Agent-Side Context Retention

The CyberGym agent keeps the full step chain and uses content-level compression rather than round deletion:

- full step history is retained
- the newest 10 distinct steps remain raw
- the earliest 3 distinct steps remain raw
- older long tool results are moved into artifacts with preview metadata

## Verification Focus

For public-server runs that only expose vulnerable-binary behavior:

- `verification_scope == "vul_only"`
- `vul_exit_code != 0`

is treated as a success stop condition by the CyberGym agent/runtime contract.

## Local Validation

The integration is covered by targeted tests around:

- recipe workspace layout
- history retention and compaction
- model retry and timeout defaults
- runtime prompt/tool-path preservation
1 change: 1 addition & 0 deletions docs/reference/model-family-matrix.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ description: "The built-in QitOS v0.4 gold presets and their default harness pol
| Family | Transport | Default protocol | Fallback chain | Tool delivery | Notes |
|---|---|---|---|---|---|
| Qwen | OpenAI-compatible | `json_decision_v1` | `xml_decision_v1 -> react_text_v1` | `api_parameter` | Native tool calls are preferred when the endpoint returns `tool_calls` |
| GLM | OpenAI-compatible | `json_decision_v1` | `xml_decision_v1 -> react_text_v1` | `api_parameter` | Native tool calls are preferred when the endpoint returns `tool_calls`; tuned for GLM-5.1 style OpenAI-compatible serving |
| Kimi | OpenAI-compatible | `json_decision_v1` | `react_text_v1` | `api_parameter` | Keep the same coding-agent shape with minimal prompt churn |
| MiniMax | OpenAI-compatible | `minimax_tool_call_v1` | `terminus_xml_v1 -> terminus_json_v1 -> json_decision_v1` | `api_parameter` | Preserves the MiniMax-specific parser advantage |
| `gpt-oss` | OpenAI-compatible | `json_decision_v1` | `react_text_v1` | `api_parameter` | Targets open-weight or third-party compatible serving |
Expand Down
Loading
Loading