Background
I've been working on a project called archive-first-harness that tries to answer a question OpenHarness doesn't currently address:
After a run completes — succeeds, fails, or regresses — can you explain why with evidence instead of intuition?
The diagnose skill I contributed in #17 hints at this direction, but the skill only works if the underlying artifacts exist. Right now, OpenHarness doesn't produce them.
The gap
OpenHarness has great infrastructure for running agents (tools, skills, permissions, swarm). What it doesn't have is a structured record of what happened during a run.
Concretely, there's no built-in way to answer:
| Question |
Why it matters |
| Why did this run fail? |
Failure without localization isn't actionable |
| Where exactly did it fail? |
Was it a tool error, permission block, or bad output? |
| Why is this run worse than last time? |
Without comparison, optimization is guesswork |
| Did the run actually produce the right artifact? |
"Looks successful" isn't enough |
What I'm proposing
A lightweight, optional run archive layer that writes structured evidence after each run.
After each run, a directory is written under artifacts/runs/<run_id>/ containing:
manifest.json — run metadata: task, model, timestamps, tool call count
execution_trace.jsonl — one entry per tool call: name, input, output, duration
verification_report.json — did the output match expectations?
failure_signature.json — if failed: which stage, what error, stack context
Key design constraints I'd suggest:
- Opt-in — disabled by default, enabled via
--archive flag or config
- Append-only — never modifies past runs, only writes new ones
- Zero dependencies — plain JSON/JSONL files, no database
- CLI helpers —
oh archive --latest, oh archive --run-id <id>, oh archive --compare <id1> <id2>
Why this fits OpenHarness
OpenHarness's philosophy is "lightweight and inspectable." A file-based archive layer is exactly that — no hosted service, no database, just structured files you can read, grep, and diff.
It also makes the diagnose skill actually useful: right now the skill tells the model how to diagnose, but there's nothing to diagnose against.
Questions for maintainers
- Is this direction something you'd want in core, or better as an optional plugin/extension?
- Is there an existing pattern in the codebase I should follow (e.g. the session storage layer)?
- Any concerns about the file layout or the
oh archive CLI surface?
Happy to prototype this if there's interest. The design is already proven in archive-first-harness — the main work would be adapting it to fit OpenHarness's architecture.
Background
I've been working on a project called archive-first-harness that tries to answer a question OpenHarness doesn't currently address:
The
diagnoseskill I contributed in #17 hints at this direction, but the skill only works if the underlying artifacts exist. Right now, OpenHarness doesn't produce them.The gap
OpenHarness has great infrastructure for running agents (tools, skills, permissions, swarm). What it doesn't have is a structured record of what happened during a run.
Concretely, there's no built-in way to answer:
What I'm proposing
A lightweight, optional run archive layer that writes structured evidence after each run.
After each run, a directory is written under
artifacts/runs/<run_id>/containing:manifest.json— run metadata: task, model, timestamps, tool call countexecution_trace.jsonl— one entry per tool call: name, input, output, durationverification_report.json— did the output match expectations?failure_signature.json— if failed: which stage, what error, stack contextKey design constraints I'd suggest:
--archiveflag or configoh archive --latest,oh archive --run-id <id>,oh archive --compare <id1> <id2>Why this fits OpenHarness
OpenHarness's philosophy is "lightweight and inspectable." A file-based archive layer is exactly that — no hosted service, no database, just structured files you can read, grep, and diff.
It also makes the
diagnoseskill actually useful: right now the skill tells the model how to diagnose, but there's nothing to diagnose against.Questions for maintainers
oh archiveCLI surface?Happy to prototype this if there's interest. The design is already proven in archive-first-harness — the main work would be adapting it to fit OpenHarness's architecture.