add deterministic inter-agent message record/replay with JSONL spec and verifier CLI

### Description

`RecordingProvider` + `MockProvider` today capture only LLM provider calls. Inter-agent workflows execute as sequences of LLM calls, so single-process multi-agent replay is deterministic transitively. Cross-process or explicit inter-agent event recording is not supported.

Concrete gaps today:

- **No structured event log**. The recording file shape is keyed by `step_id`/`prompt_hash`; cross-process causality is not encoded.
- **No reproducibility guarantee**. Two replays of the same recording can diverge silently if any non-determinism leaks (event ordering, dict iteration, tool side effects).
- **No verifier**. Nothing in the toolchain asserts "this recording is reproducible".

This blocks two practical use cases:

- Regression tests for multi-agent workflows that need bit-for-bit reproducibility across CI runs.
- Post-incident replay where the goal is to reproduce a faulty multi-agent interaction exactly, not just hit similar prompts.

### Proposal

**1. JSONL format with a published JSON Schema** covering six event types: `task_sent`, `task_received`, `response_sent`, `tool_call`, `tool_return`, `state_transition`. Each event carries:

- `event_id` (UUID v4)
- `agent_id` (string)
- `timestamp_ns` (monotonic per-process; informational only, not used for ordering)
- `parent_event_id` (UUID v4 or null) — encodes dep-graph causality
- `payload_hash` (SHA-256, 16-char prefix)
- type-specific fields (e.g. `task_sent` carries `recipient_agent_id`, `payload_ref`)

**2. Deterministic replay by dependency-graph ordering**, not by timestamp (timestamps are not monotonic across processes). `parent_event_id` encodes causal ordering; replay re-executes in topological order. Nodes with no dependency are sorted by `event_id` for stable tie-breaking.

**3. CLI `agentloom verify-determinism <recording>`**: runs the recording through `MockProvider` replay and asserts byte-for-byte identical output across two independent runs. Exit code non-zero on any divergence; diff printed to stderr.

**4. Extend `RecordingProvider`** with observer hook for inter-agent events (opt-in, activated when the runtime emits multi-agent steps). Existing single-LLM recordings keep their format unchanged — multi-agent recordings nest under a `_inter_agent_events` key.

### Scope

- `src/agentloom/record_replay/inter_agent.py` — event types + JSONL writer.
- `src/agentloom/record_replay/schema.json` — committed JSON Schema (queryable at `agentloom.record_replay.SCHEMA_URI`).
- `src/agentloom/record_replay/verify.py` — replay verifier (topological sort + diff).
- `src/agentloom/cli/verify_determinism.py` — CLI entry point.
- `docs/record-replay-spec.md` — format specification + dependency-graph algorithm prose.
- `agentloom.contracts.experimental` — re-export the event types under the experimental tier.

### Regression tests

- `test_jsonl_schema_validates` — known-good and known-bad event payloads.
- `test_dep_graph_ordering_deterministic` — same DAG, different traversal order at write time → identical replay output.
- `test_verify_cli_exits_zero_on_identical_runs`.
- `test_verify_cli_exits_nonzero_on_divergence` — inject a non-determinism (e.g. `random.random()` in a tool) and assert exit code ≠ 0 + diff printed.
- `test_inter_agent_events_extend_recordings_without_breaking_v2_format` — existing single-LLM `RecordingProvider` recordings load and replay unchanged.

### Notes

- Delta vs OSS observability tools (LangSmith, Phoenix, Langfuse): they trace multi-agent runs but none specify a deterministic-replay contract with an executable verifier.
- Algorithm choice (dep-graph ordering vs vector clocks vs Lamport timestamps): dep-graph is the simplest deterministic ordering compatible with AgentLoom's existing DAG layer model. Vector clocks would scale to N processes but require coordinated state per agent; Lamport timestamps are not unique. Dep-graph also matches how the engine already enumerates parallel layers, so the writer side adds no new ordering machinery.
- Depends on: #107 (record/replay fixes — concurrent write race, streaming capture, hash-key coverage). Implementation can start once #107 lands.
- Target: 0.5.0, alongside #125.
- `agentloom.contracts.experimental` is the right tier for the event schemas — they will iterate during initial adoption and graduate to `.stable` once external consumers settle on a stable shape.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add deterministic inter-agent message record/replay with JSONL spec and verifier CLI #135

Description

Proposal

Scope

Regression tests

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

add deterministic inter-agent message record/replay with JSONL spec and verifier CLI #135

Description

Description

Proposal

Scope

Regression tests

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions