[Discussion] Run-level evidence layer: structured archive for agent runs

## Background

I've been working on a project called [archive-first-harness](https://github.com/quzhiii/archive-first-harness) that tries to answer a question OpenHarness doesn't currently address:

> After a run completes — succeeds, fails, or regresses — can you explain *why* with evidence instead of intuition?

The `diagnose` skill I contributed in #17 hints at this direction, but the skill only works if the underlying artifacts exist. Right now, OpenHarness doesn't produce them.

## The gap

OpenHarness has great infrastructure for *running* agents (tools, skills, permissions, swarm). What it doesn't have is a structured record of *what happened* during a run.

Concretely, there's no built-in way to answer:

| Question | Why it matters |
|---|---|
| Why did this run fail? | Failure without localization isn't actionable |
| Where exactly did it fail? | Was it a tool error, permission block, or bad output? |
| Why is this run worse than last time? | Without comparison, optimization is guesswork |
| Did the run actually produce the right artifact? | "Looks successful" isn't enough |

## What I'm proposing

A lightweight, optional **run archive layer** that writes structured evidence after each run.

After each run, a directory is written under `artifacts/runs/<run_id>/` containing:

- `manifest.json` — run metadata: task, model, timestamps, tool call count
- `execution_trace.jsonl` — one entry per tool call: name, input, output, duration
- `verification_report.json` — did the output match expectations?
- `failure_signature.json` — if failed: which stage, what error, stack context

Key design constraints I'd suggest:

- **Opt-in** — disabled by default, enabled via `--archive` flag or config
- **Append-only** — never modifies past runs, only writes new ones
- **Zero dependencies** — plain JSON/JSONL files, no database
- **CLI helpers** — `oh archive --latest`, `oh archive --run-id <id>`, `oh archive --compare <id1> <id2>`

## Why this fits OpenHarness

OpenHarness's philosophy is "lightweight and inspectable." A file-based archive layer is exactly that — no hosted service, no database, just structured files you can read, grep, and diff.

It also makes the `diagnose` skill actually useful: right now the skill tells the model *how* to diagnose, but there's nothing to diagnose against.

## Questions for maintainers

1. Is this direction something you'd want in core, or better as an optional plugin/extension?
2. Is there an existing pattern in the codebase I should follow (e.g. the session storage layer)?
3. Any concerns about the file layout or the `oh archive` CLI surface?

Happy to prototype this if there's interest. The design is already proven in [archive-first-harness](https://github.com/quzhiii/archive-first-harness) — the main work would be adapting it to fit OpenHarness's architecture.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] Run-level evidence layer: structured archive for agent runs #18

Background

The gap

What I'm proposing

Why this fits OpenHarness

Questions for maintainers

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question	Why it matters
Why did this run fail?	Failure without localization isn't actionable
Where exactly did it fail?	Was it a tool error, permission block, or bad output?
Why is this run worse than last time?	Without comparison, optimization is guesswork
Did the run actually produce the right artifact?	"Looks successful" isn't enough

[Discussion] Run-level evidence layer: structured archive for agent runs #18

Description

Background

The gap

What I'm proposing

Why this fits OpenHarness

Questions for maintainers

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions