Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,34 @@ All notable changes to this project are documented here.

### Added

**Deterministic replay**

- `RLM.Replay` — replay orchestrator that re-executes a previously recorded run using
the original LLM responses. Replays re-evaluate all code but skip live LLM calls,
enabling debugging, regression testing, and model comparison. Supports:
- `:patch` option — `%{iteration => code}` map to replace code at specific iterations
while maintaining tape alignment (LLM response is still consumed)
- `:config` option — config overrides for the replay run
- `RLM.Replay.Tape` — struct representing an ordered sequence of recorded LLM responses,
with `from_events/1` builder that sources from EventLog Agent or TraceStore fallback
- `RLM.Replay.LLM` — `RLM.LLM` behaviour implementation that returns responses from a
tape via process dictionary (matching the existing process-dict pattern in `RLM.Eval`)
- `RLM.replay/2` — public API delegation to `RLM.Replay.replay/2`
- `enable_replay_recording` config flag (default: `false`) — when enabled, records the
full raw LLM response text for each iteration as `:llm_response` events in the trace
- `[:rlm, :llm, :response, :recorded]` telemetry event — emitted by Worker after each
successful LLM call when recording is enabled
- `original_context` and `original_query` fields in `:node_start` events for depth-0
workers — enables replay to recover the original inputs
- `:replay_patches` field on `RLM.Worker` struct — applied before eval to substitute
code at specific iterations during replay
- `RLM.Replay.FallbackLLM` — LLM behaviour impl that tries tape entries first,
then falls back to a live LLM module when the tape is exhausted
- `:fallback` option on `RLM.replay/2` — `:error` (default) or `:live` to switch
to live LLM calls when the tape runs out (e.g., because a patch caused extra iterations)
- 17 tests covering recording, tape construction, replay LLM, replay orchestration,
patching, fallback behavior, and the public API

**Distributed Erlang node support**

- `RLM.Node` — lightweight wrapper for OTP distribution with three public functions:
Expand Down
9 changes: 7 additions & 2 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ rlm/

Enforced at compile time via the `boundary` library:

- **`RLM`** — Core engine. Zero web dependencies. Exports: Config, Run, Worker, EventLog, TraceStore, Helpers, Span, IEx, Node, Telemetry, Tool, ToolRegistry.
- **`RLM`** — Core engine. Zero web dependencies. Exports: Config, Run, Worker, EventLog, TraceStore, Helpers, Span, IEx, Node, Replay, Replay.Tape, Replay.LLM, Replay.FallbackLLM, Telemetry, Tool, ToolRegistry.
- **`RLMWeb`** — Phoenix web layer. Depends on `RLM`. Exports: Endpoint.
- **`RLM.Application`** — Top-level. Depends on both `RLM` and `RLMWeb`. Starts the unified supervision tree.

Expand Down Expand Up @@ -180,7 +180,7 @@ Default models:

| Module | Purpose |
|---|---|
| `RLM` | Public API: `run/3`, `start_session/1`, `send_message/3`, `history/1`, `status/1` |
| `RLM` | Public API: `run/3`, `replay/2`, `start_session/1`, `send_message/3`, `history/1`, `status/1` |
| `RLM.Run` | Per-run coordinator; owns worker DynSup + eval TaskSup, ETS worker tree, crash propagation |
| `RLM.Config` | Config struct; loads from app env + keyword overrides |
| `RLM.Worker` | GenServer per execution node; iterate loop + keep_alive mode; delegates spawning to Run |
Expand All @@ -202,6 +202,10 @@ Default models:
| `RLM.Telemetry.Logger` | Structured logging handler |
| `RLM.Telemetry.PubSub` | Phoenix.PubSub broadcast handler |
| `RLM.Telemetry.EventLogHandler` | Routes telemetry events to EventLog Agent AND TraceStore |
| `RLM.Replay` | Replay orchestrator: `replay/2` replays a recorded run with optional code patches |
| `RLM.Replay.Tape` | Tape struct + `from_events/1` builder; ordered sequence of recorded LLM responses |
| `RLM.Replay.LLM` | LLM behaviour impl that returns responses from a tape (process-dict based) |
| `RLM.Replay.FallbackLLM` | LLM behaviour impl that tries tape first, then falls back to a live LLM module |
| `RLM.Application` | OTP application; starts unified supervision tree (core + web) |

### Filesystem Tools
Expand Down Expand Up @@ -262,6 +266,7 @@ Read-only Phoenix LiveView dashboard. Serves on `http://localhost:4000`.
| `enable_otel` | `false` | Enable OpenTelemetry integration |
| `enable_event_log` | `true` | Enable per-run EventLog trace agents |
| `event_log_capture_full_stdout` | `false` | Store full stdout in traces (vs truncated) |
| `enable_replay_recording` | `false` | Record full LLM responses for deterministic replay |
| `llm_module` | `RLM.LLM` | Swappable for `RLM.Test.MockLLM` |

## Testing Conventions
Expand Down
26 changes: 26 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ runs in a persistent REPL, with recursive sub-agent spawning and built-in filesy
- [Interactive sessions](#interactive-sessions)
- [IEx helpers](#iex-helpers)
- [Configuration overrides](#configuration-overrides)
- [Deterministic replay](#deterministic-replay)
- [Sandbox API](#sandbox-api)
- [Architecture](#architecture)
- [OTP supervision tree](#otp-supervision-tree)
Expand Down Expand Up @@ -143,6 +144,31 @@ watch(session) # attach a live telemetry stream
)
```

### Deterministic replay

Record a run and replay it later — useful for debugging, regression testing, and cost
optimization (replay re-evaluates all code without making LLM calls).

```elixir
# Step 1: Record a run (enable_replay_recording captures full LLM responses)
{:ok, answer, run_id} = RLM.run(context, query, enable_replay_recording: true)

# Step 2: Replay — same code runs, same result, no LLM calls
{:ok, same_answer, replay_run_id} = RLM.replay(run_id)

# Step 3: Replay with a patch — try different code at a specific iteration
{:ok, new_answer, _} = RLM.replay(run_id,
patch: %{0 => ~s(final_answer = "patched result")}
)

# Step 4: Patch + fallback — if the patch causes extra iterations,
# fall back to live LLM calls instead of erroring
{:ok, answer, _} = RLM.replay(run_id,
patch: %{0 => "x = 42"},
fallback: :live
)
```

### Sandbox API

Functions available inside the LLM's eval'd code:
Expand Down
29 changes: 29 additions & 0 deletions lib/rlm.ex
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,10 @@ defmodule RLM do
Span,
IEx,
Node,
Replay,
Replay.Tape,
Replay.LLM,
Replay.FallbackLLM,
Telemetry,
Telemetry.PubSub,
Tool,
Expand Down Expand Up @@ -209,6 +213,31 @@ defmodule RLM do
:exit, _ -> {:error, :not_found}
end

# ---------------------------------------------------------------------------
# Replay API
# ---------------------------------------------------------------------------

@doc """
Replay a previously recorded run.

Requires the original run to have been executed with `enable_replay_recording: true`.
The replay re-executes all eval'd code using the recorded LLM responses.

## Options

* `:patch` — `%{iteration => code}` map to replace code at specific iterations
* `:fallback` — `:error` (default) or `:live` to switch to live LLM calls
when the tape runs out (e.g., because a patch caused extra iterations)
* `:config` — config overrides for the replay run (set `llm_module` here
to control which module handles live fallback calls)

Returns `{:ok, answer, replay_run_id}` or `{:error, reason}`.
"""
@spec replay(String.t(), keyword()) :: {:ok, any(), String.t()} | {:error, any()}
def replay(run_id, opts \\ []) do
RLM.Replay.replay(run_id, opts)
end

# ---------------------------------------------------------------------------
# Run management
# ---------------------------------------------------------------------------
Expand Down
2 changes: 2 additions & 0 deletions lib/rlm/config.ex
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ defmodule RLM.Config do
:enable_otel,
:enable_event_log,
:event_log_capture_full_stdout,
:enable_replay_recording,
:llm_module
]

Expand Down Expand Up @@ -57,6 +58,7 @@ defmodule RLM.Config do
enable_otel: get(overrides, :enable_otel, false),
enable_event_log: get(overrides, :enable_event_log, true),
event_log_capture_full_stdout: get(overrides, :event_log_capture_full_stdout, false),
enable_replay_recording: get(overrides, :enable_replay_recording, false),
llm_module: get(overrides, :llm_module, RLM.LLM)
}
end
Expand Down
109 changes: 109 additions & 0 deletions lib/rlm/replay.ex
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
defmodule RLM.Replay do
@moduledoc """
Replay a previously recorded RLM run.

Replays use the recorded LLM responses but re-execute all eval'd code.
This verifies that the same code still works against the current environment.

## Options

* `:patch` — `%{iteration => code}` map. At the specified iteration, the
patched code replaces the LLM's code before eval. The LLM response is
still consumed from the tape (to maintain iteration alignment).
* `:fallback` — what to do when the tape runs out:
- `:error` (default) — return an error
- `:live` — switch to live LLM calls for remaining iterations
* `:config` — config overrides applied to the replay run. When using
`fallback: :live`, set `llm_module` here to control which module
handles the live calls (defaults to `RLM.LLM`).
"""

@spec replay(String.t(), keyword()) :: {:ok, any(), String.t()} | {:error, any()}
def replay(run_id, opts \\ []) do
patches = Keyword.get(opts, :patch, %{})
config_overrides = Keyword.get(opts, :config, [])
fallback = Keyword.get(opts, :fallback, :error)

with {:ok, tape} <- RLM.Replay.Tape.from_events(run_id) do
# Resolve the LLM module: tape-only or tape-with-fallback
{llm_module, fallback_module} =
case fallback do
:live ->
# The fallback module comes from config overrides, or the default
user_config = RLM.Config.load(config_overrides)
{RLM.Replay.FallbackLLM, user_config.llm_module}

:error ->
{RLM.Replay.LLM, nil}
end

config =
RLM.Config.load(
Keyword.merge(config_overrides,
llm_module: llm_module,
enable_replay_recording: false
)
)

# Extract original context and query from the tape
context = tape.context || ""
query = tape.query || context

replay_run_id = RLM.Span.generate_run_id()
span_id = RLM.Span.generate_id()

run_opts = [run_id: replay_run_id, config: config]

case DynamicSupervisor.start_child(RLM.RunSup, {RLM.Run, run_opts}) do
{:ok, run_pid} ->
worker_opts = [
span_id: span_id,
run_id: replay_run_id,
context: context,
query: query,
config: config,
depth: 0,
model: config.model_large,
caller: self(),
replay_tape: tape,
replay_patches: patches,
replay_fallback_module: fallback_module
]

case RLM.Run.start_worker(run_pid, worker_opts) do
{:ok, pid} ->
ref = Process.monitor(pid)
total_timeout = config.eval_timeout * 2

receive do
{:rlm_result, ^span_id, {:ok, answer}} ->
Process.demonitor(ref, [:flush])
{:ok, answer, replay_run_id}

{:rlm_result, ^span_id, {:error, reason}} ->
Process.demonitor(ref, [:flush])
{:error, reason}

{:DOWN, ^ref, :process, ^pid, :normal} ->
{:error, "Replay worker exited without result"}

{:DOWN, ^ref, :process, ^pid, reason} ->
{:error, "Replay worker crashed: #{inspect(reason)}"}
after
total_timeout ->
Process.demonitor(ref, [:flush])
RLM.terminate_run(run_pid)
{:error, "Replay timed out after #{total_timeout}ms"}
end

{:error, reason} ->
RLM.terminate_run(run_pid)
{:error, "Failed to start replay worker: #{inspect(reason)}"}
end

{:error, reason} ->
{:error, "Failed to start replay run: #{inspect(reason)}"}
end
end
end
end
42 changes: 42 additions & 0 deletions lib/rlm/replay/fallback_llm.ex
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
defmodule RLM.Replay.FallbackLLM do
@moduledoc """
An LLM module that replays from a tape, falling back to a live LLM
module when the tape is exhausted.

Used by `RLM.Replay.replay/2` with `fallback: :live`. The fallback
module is stored in the process dictionary alongside the tape entries.
"""

@behaviour RLM.LLM

@impl true
def chat(messages, model, config, opts \\ []) do
case pop_entry() do
nil ->
fallback_module = Process.get(:rlm_replay_fallback_module, RLM.LLM)
fallback_module.chat(messages, model, config, opts)

entry ->
{:ok, entry.response, entry.usage}
end
end

@doc "Initialize the replay state with tape entries and a fallback module."
@spec load_tape(RLM.Replay.Tape.t(), module()) :: :ok
def load_tape(%RLM.Replay.Tape{entries: entries}, fallback_module) do
Process.put(:rlm_replay_entries, entries)
Process.put(:rlm_replay_fallback_module, fallback_module)
:ok
end

defp pop_entry do
case Process.get(:rlm_replay_entries, []) do
[] ->
nil

[entry | rest] ->
Process.put(:rlm_replay_entries, rest)
entry
end
end
end
47 changes: 47 additions & 0 deletions lib/rlm/replay/llm.ex
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
defmodule RLM.Replay.LLM do
@moduledoc """
An LLM module that replays responses from a recorded tape.
Used by `RLM.Replay.replay/2` to re-execute a run deterministically.

Responses are consumed in order. If the tape runs out, returns an error.

## Process dictionary

This uses the process dictionary because the `RLM.LLM` behaviour's `chat/4`
callback doesn't have a slot for replay state. Since Worker calls `chat/4`
synchronously from its own process, the tape state lives in the Worker's
process dict. This matches the existing pattern where `RLM.Eval` uses
process dict for `worker_pid`, `cwd`, etc.
"""

@behaviour RLM.LLM

@impl true
def chat(_messages, _model, _config, _opts \\ []) do
case pop_entry() do
nil ->
{:error, "Replay tape exhausted — no more recorded responses"}

entry ->
{:ok, entry.response, entry.usage}
end
end

@doc "Initialize the replay state for the current process."
@spec load_tape(RLM.Replay.Tape.t()) :: :ok
def load_tape(%RLM.Replay.Tape{entries: entries}) do
Process.put(:rlm_replay_entries, entries)
:ok
end

defp pop_entry do
case Process.get(:rlm_replay_entries, []) do
[] ->
nil

[entry | rest] ->
Process.put(:rlm_replay_entries, rest)
entry
end
end
end
Loading