Skip to content

[Discussion] Run-level evidence layer: structured archive for agent runs #18

@quzhiii

Description

@quzhiii

Background

I've been working on a project called archive-first-harness that tries to answer a question OpenHarness doesn't currently address:

After a run completes — succeeds, fails, or regresses — can you explain why with evidence instead of intuition?

The diagnose skill I contributed in #17 hints at this direction, but the skill only works if the underlying artifacts exist. Right now, OpenHarness doesn't produce them.

The gap

OpenHarness has great infrastructure for running agents (tools, skills, permissions, swarm). What it doesn't have is a structured record of what happened during a run.

Concretely, there's no built-in way to answer:

Question Why it matters
Why did this run fail? Failure without localization isn't actionable
Where exactly did it fail? Was it a tool error, permission block, or bad output?
Why is this run worse than last time? Without comparison, optimization is guesswork
Did the run actually produce the right artifact? "Looks successful" isn't enough

What I'm proposing

A lightweight, optional run archive layer that writes structured evidence after each run.

After each run, a directory is written under artifacts/runs/<run_id>/ containing:

  • manifest.json — run metadata: task, model, timestamps, tool call count
  • execution_trace.jsonl — one entry per tool call: name, input, output, duration
  • verification_report.json — did the output match expectations?
  • failure_signature.json — if failed: which stage, what error, stack context

Key design constraints I'd suggest:

  • Opt-in — disabled by default, enabled via --archive flag or config
  • Append-only — never modifies past runs, only writes new ones
  • Zero dependencies — plain JSON/JSONL files, no database
  • CLI helpersoh archive --latest, oh archive --run-id <id>, oh archive --compare <id1> <id2>

Why this fits OpenHarness

OpenHarness's philosophy is "lightweight and inspectable." A file-based archive layer is exactly that — no hosted service, no database, just structured files you can read, grep, and diff.

It also makes the diagnose skill actually useful: right now the skill tells the model how to diagnose, but there's nothing to diagnose against.

Questions for maintainers

  1. Is this direction something you'd want in core, or better as an optional plugin/extension?
  2. Is there an existing pattern in the codebase I should follow (e.g. the session storage layer)?
  3. Any concerns about the file layout or the oh archive CLI surface?

Happy to prototype this if there's interest. The design is already proven in archive-first-harness — the main work would be adapting it to fit OpenHarness's architecture.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions