errantsky · errantsky · Mar 6, 2026 · Mar 6, 2026 · Mar 6, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,6 +8,34 @@ All notable changes to this project are documented here.
 
 ### Added
 
+**Deterministic replay**
+
+- `RLM.Replay` — replay orchestrator that re-executes a previously recorded run using
+  the original LLM responses. Replays re-evaluate all code but skip live LLM calls,
+  enabling debugging, regression testing, and model comparison. Supports:
+  - `:patch` option — `%{iteration => code}` map to replace code at specific iterations
+    while maintaining tape alignment (LLM response is still consumed)
+  - `:config` option — config overrides for the replay run
+- `RLM.Replay.Tape` — struct representing an ordered sequence of recorded LLM responses,
+  with `from_events/1` builder that sources from EventLog Agent or TraceStore fallback
+- `RLM.Replay.LLM` — `RLM.LLM` behaviour implementation that returns responses from a
+  tape via process dictionary (matching the existing process-dict pattern in `RLM.Eval`)
+- `RLM.replay/2` — public API delegation to `RLM.Replay.replay/2`
+- `enable_replay_recording` config flag (default: `false`) — when enabled, records the
+  full raw LLM response text for each iteration as `:llm_response` events in the trace
+- `[:rlm, :llm, :response, :recorded]` telemetry event — emitted by Worker after each
+  successful LLM call when recording is enabled
+- `original_context` and `original_query` fields in `:node_start` events for depth-0
+  workers — enables replay to recover the original inputs
+- `:replay_patches` field on `RLM.Worker` struct — applied before eval to substitute
+  code at specific iterations during replay
+- `RLM.Replay.FallbackLLM` — LLM behaviour impl that tries tape entries first,
+  then falls back to a live LLM module when the tape is exhausted
+- `:fallback` option on `RLM.replay/2` — `:error` (default) or `:live` to switch
+  to live LLM calls when the tape runs out (e.g., because a patch caused extra iterations)
+- 17 tests covering recording, tape construction, replay LLM, replay orchestration,
+  patching, fallback behavior, and the public API
+
 **Distributed Erlang node support**
 
 - `RLM.Node` — lightweight wrapper for OTP distribution with three public functions:

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -64,7 +64,7 @@ rlm/
 
 Enforced at compile time via the `boundary` library:
 
-- **`RLM`** — Core engine. Zero web dependencies. Exports: Config, Run, Worker, EventLog, TraceStore, Helpers, Span, IEx, Node, Telemetry, Tool, ToolRegistry.
+- **`RLM`** — Core engine. Zero web dependencies. Exports: Config, Run, Worker, EventLog, TraceStore, Helpers, Span, IEx, Node, Replay, Replay.Tape, Replay.LLM, Replay.FallbackLLM, Telemetry, Tool, ToolRegistry.
 - **`RLMWeb`** — Phoenix web layer. Depends on `RLM`. Exports: Endpoint.
 - **`RLM.Application`** — Top-level. Depends on both `RLM` and `RLMWeb`. Starts the unified supervision tree.
 
@@ -180,7 +180,7 @@ Default models:
 
 | Module | Purpose |
 |---|---|
-| `RLM` | Public API: `run/3`, `start_session/1`, `send_message/3`, `history/1`, `status/1` |
+| `RLM` | Public API: `run/3`, `replay/2`, `start_session/1`, `send_message/3`, `history/1`, `status/1` |
 | `RLM.Run` | Per-run coordinator; owns worker DynSup + eval TaskSup, ETS worker tree, crash propagation |
 | `RLM.Config` | Config struct; loads from app env + keyword overrides |
 | `RLM.Worker` | GenServer per execution node; iterate loop + keep_alive mode; delegates spawning to Run |
@@ -202,6 +202,10 @@ Default models:
 | `RLM.Telemetry.Logger` | Structured logging handler |
 | `RLM.Telemetry.PubSub` | Phoenix.PubSub broadcast handler |
 | `RLM.Telemetry.EventLogHandler` | Routes telemetry events to EventLog Agent AND TraceStore |
+| `RLM.Replay` | Replay orchestrator: `replay/2` replays a recorded run with optional code patches |
+| `RLM.Replay.Tape` | Tape struct + `from_events/1` builder; ordered sequence of recorded LLM responses |
+| `RLM.Replay.LLM` | LLM behaviour impl that returns responses from a tape (process-dict based) |
+| `RLM.Replay.FallbackLLM` | LLM behaviour impl that tries tape first, then falls back to a live LLM module |
 | `RLM.Application` | OTP application; starts unified supervision tree (core + web) |
 
 ### Filesystem Tools
@@ -262,6 +266,7 @@ Read-only Phoenix LiveView dashboard. Serves on `http://localhost:4000`.
 | `enable_otel` | `false` | Enable OpenTelemetry integration |
 | `enable_event_log` | `true` | Enable per-run EventLog trace agents |
 | `event_log_capture_full_stdout` | `false` | Store full stdout in traces (vs truncated) |
+| `enable_replay_recording` | `false` | Record full LLM responses for deterministic replay |
 | `llm_module` | `RLM.LLM` | Swappable for `RLM.Test.MockLLM` |
 
 ## Testing Conventions

diff --git a/README.md b/README.md
@@ -27,6 +27,7 @@ runs in a persistent REPL, with recursive sub-agent spawning and built-in filesy
   - [Interactive sessions](#interactive-sessions)
   - [IEx helpers](#iex-helpers)
   - [Configuration overrides](#configuration-overrides)
+  - [Deterministic replay](#deterministic-replay)
   - [Sandbox API](#sandbox-api)
 - [Architecture](#architecture)
   - [OTP supervision tree](#otp-supervision-tree)
@@ -143,6 +144,31 @@ watch(session)     # attach a live telemetry stream
 )
 ```
 
+### Deterministic replay
+
+Record a run and replay it later — useful for debugging, regression testing, and cost
+optimization (replay re-evaluates all code without making LLM calls).
+
+```elixir
+# Step 1: Record a run (enable_replay_recording captures full LLM responses)
+{:ok, answer, run_id} = RLM.run(context, query, enable_replay_recording: true)
+
+# Step 2: Replay — same code runs, same result, no LLM calls
+{:ok, same_answer, replay_run_id} = RLM.replay(run_id)
+
+# Step 3: Replay with a patch — try different code at a specific iteration
+{:ok, new_answer, _} = RLM.replay(run_id,
+  patch: %{0 => ~s(final_answer = "patched result")}
+)
+
+# Step 4: Patch + fallback — if the patch causes extra iterations,
+# fall back to live LLM calls instead of erroring
+{:ok, answer, _} = RLM.replay(run_id,
+  patch: %{0 => "x = 42"},
+  fallback: :live
+)
+```
+
 ### Sandbox API
 
 Functions available inside the LLM's eval'd code:

diff --git a/lib/rlm.ex b/lib/rlm.ex
@@ -11,6 +11,10 @@ defmodule RLM do
       Span,
       IEx,
       Node,
+      Replay,
+      Replay.Tape,
+      Replay.LLM,
+      Replay.FallbackLLM,
       Telemetry,
       Telemetry.PubSub,
       Tool,
@@ -209,6 +213,31 @@ defmodule RLM do
     :exit, _ -> {:error, :not_found}
   end
 
+  # ---------------------------------------------------------------------------
+  # Replay API
+  # ---------------------------------------------------------------------------
+
+  @doc """
+  Replay a previously recorded run.
+
+  Requires the original run to have been executed with `enable_replay_recording: true`.
+  The replay re-executes all eval'd code using the recorded LLM responses.
+
+  ## Options
+
+    * `:patch` — `%{iteration => code}` map to replace code at specific iterations
+    * `:fallback` — `:error` (default) or `:live` to switch to live LLM calls
+      when the tape runs out (e.g., because a patch caused extra iterations)
+    * `:config` — config overrides for the replay run (set `llm_module` here
+      to control which module handles live fallback calls)
+
+  Returns `{:ok, answer, replay_run_id}` or `{:error, reason}`.
+  """
+  @spec replay(String.t(), keyword()) :: {:ok, any(), String.t()} | {:error, any()}
+  def replay(run_id, opts \\ []) do
+    RLM.Replay.replay(run_id, opts)
+  end
+
   # ---------------------------------------------------------------------------
   # Run management
   # ---------------------------------------------------------------------------

diff --git a/lib/rlm/config.ex b/lib/rlm/config.ex
@@ -26,6 +26,7 @@ defmodule RLM.Config do
     :enable_otel,
     :enable_event_log,
     :event_log_capture_full_stdout,
+    :enable_replay_recording,
     :llm_module
   ]
 
@@ -57,6 +58,7 @@ defmodule RLM.Config do
       enable_otel: get(overrides, :enable_otel, false),
       enable_event_log: get(overrides, :enable_event_log, true),
       event_log_capture_full_stdout: get(overrides, :event_log_capture_full_stdout, false),
+      enable_replay_recording: get(overrides, :enable_replay_recording, false),
       llm_module: get(overrides, :llm_module, RLM.LLM)
     }
   end

diff --git a/lib/rlm/replay.ex b/lib/rlm/replay.ex
@@ -0,0 +1,109 @@
+defmodule RLM.Replay do
+  @moduledoc """
+  Replay a previously recorded RLM run.
+
+  Replays use the recorded LLM responses but re-execute all eval'd code.
+  This verifies that the same code still works against the current environment.
+
+  ## Options
+
+    * `:patch` — `%{iteration => code}` map. At the specified iteration, the
+      patched code replaces the LLM's code before eval. The LLM response is
+      still consumed from the tape (to maintain iteration alignment).
+    * `:fallback` — what to do when the tape runs out:
+      - `:error` (default) — return an error
+      - `:live` — switch to live LLM calls for remaining iterations
+    * `:config` — config overrides applied to the replay run. When using
+      `fallback: :live`, set `llm_module` here to control which module
+      handles the live calls (defaults to `RLM.LLM`).
+  """
+
+  @spec replay(String.t(), keyword()) :: {:ok, any(), String.t()} | {:error, any()}
+  def replay(run_id, opts \\ []) do
+    patches = Keyword.get(opts, :patch, %{})
+    config_overrides = Keyword.get(opts, :config, [])
+    fallback = Keyword.get(opts, :fallback, :error)
+
+    with {:ok, tape} <- RLM.Replay.Tape.from_events(run_id) do
+      # Resolve the LLM module: tape-only or tape-with-fallback
+      {llm_module, fallback_module} =
+        case fallback do
+          :live ->
+            # The fallback module comes from config overrides, or the default
+            user_config = RLM.Config.load(config_overrides)
+            {RLM.Replay.FallbackLLM, user_config.llm_module}
+
+          :error ->
+            {RLM.Replay.LLM, nil}
+        end
+
+      config =
+        RLM.Config.load(
+          Keyword.merge(config_overrides,
+            llm_module: llm_module,
+            enable_replay_recording: false
+          )
+        )
+
+      # Extract original context and query from the tape
+      context = tape.context || ""
+      query = tape.query || context
+
+      replay_run_id = RLM.Span.generate_run_id()
+      span_id = RLM.Span.generate_id()
+
+      run_opts = [run_id: replay_run_id, config: config]
+
+      case DynamicSupervisor.start_child(RLM.RunSup, {RLM.Run, run_opts}) do
+        {:ok, run_pid} ->
+          worker_opts = [
+            span_id: span_id,
+            run_id: replay_run_id,
+            context: context,
+            query: query,
+            config: config,
+            depth: 0,
+            model: config.model_large,
+            caller: self(),
+            replay_tape: tape,
+            replay_patches: patches,
+            replay_fallback_module: fallback_module
+          ]
+
+          case RLM.Run.start_worker(run_pid, worker_opts) do
+            {:ok, pid} ->
+              ref = Process.monitor(pid)
+              total_timeout = config.eval_timeout * 2
+
+              receive do
+                {:rlm_result, ^span_id, {:ok, answer}} ->
+                  Process.demonitor(ref, [:flush])
+                  {:ok, answer, replay_run_id}
+
+                {:rlm_result, ^span_id, {:error, reason}} ->
+                  Process.demonitor(ref, [:flush])
+                  {:error, reason}
+
+                {:DOWN, ^ref, :process, ^pid, :normal} ->
+                  {:error, "Replay worker exited without result"}
+
+                {:DOWN, ^ref, :process, ^pid, reason} ->
+                  {:error, "Replay worker crashed: #{inspect(reason)}"}
+              after
+                total_timeout ->
+                  Process.demonitor(ref, [:flush])
+                  RLM.terminate_run(run_pid)
+                  {:error, "Replay timed out after #{total_timeout}ms"}
+              end
+
+            {:error, reason} ->
+              RLM.terminate_run(run_pid)
+              {:error, "Failed to start replay worker: #{inspect(reason)}"}
+          end
+
+        {:error, reason} ->
+          {:error, "Failed to start replay run: #{inspect(reason)}"}
+      end
+    end
+  end
+end
diff --git a/lib/rlm/replay/fallback_llm.ex b/lib/rlm/replay/fallback_llm.ex
@@ -0,0 +1,42 @@
+defmodule RLM.Replay.FallbackLLM do
+  @moduledoc """
+  An LLM module that replays from a tape, falling back to a live LLM
+  module when the tape is exhausted.
+
+  Used by `RLM.Replay.replay/2` with `fallback: :live`. The fallback
+  module is stored in the process dictionary alongside the tape entries.
+  """
+
+  @behaviour RLM.LLM
+
+  @impl true
+  def chat(messages, model, config, opts \\ []) do
+    case pop_entry() do
+      nil ->
+        fallback_module = Process.get(:rlm_replay_fallback_module, RLM.LLM)
+        fallback_module.chat(messages, model, config, opts)
+
+      entry ->
+        {:ok, entry.response, entry.usage}
+    end
+  end
+
+  @doc "Initialize the replay state with tape entries and a fallback module."
+  @spec load_tape(RLM.Replay.Tape.t(), module()) :: :ok
+  def load_tape(%RLM.Replay.Tape{entries: entries}, fallback_module) do
+    Process.put(:rlm_replay_entries, entries)
+    Process.put(:rlm_replay_fallback_module, fallback_module)
+    :ok
+  end
+
+  defp pop_entry do
+    case Process.get(:rlm_replay_entries, []) do
+      [] ->
+        nil
+
+      [entry | rest] ->
+        Process.put(:rlm_replay_entries, rest)
+        entry
+    end
+  end
+end
diff --git a/lib/rlm/replay/llm.ex b/lib/rlm/replay/llm.ex
@@ -0,0 +1,47 @@
+defmodule RLM.Replay.LLM do
+  @moduledoc """
+  An LLM module that replays responses from a recorded tape.
+  Used by `RLM.Replay.replay/2` to re-execute a run deterministically.
+
+  Responses are consumed in order. If the tape runs out, returns an error.
+
+  ## Process dictionary
+
+  This uses the process dictionary because the `RLM.LLM` behaviour's `chat/4`
+  callback doesn't have a slot for replay state. Since Worker calls `chat/4`
+  synchronously from its own process, the tape state lives in the Worker's
+  process dict. This matches the existing pattern where `RLM.Eval` uses
+  process dict for `worker_pid`, `cwd`, etc.
+  """
+
+  @behaviour RLM.LLM
+
+  @impl true
+  def chat(_messages, _model, _config, _opts \\ []) do
+    case pop_entry() do
+      nil ->
+        {:error, "Replay tape exhausted — no more recorded responses"}
+
+      entry ->
+        {:ok, entry.response, entry.usage}
+    end
+  end
+
+  @doc "Initialize the replay state for the current process."
+  @spec load_tape(RLM.Replay.Tape.t()) :: :ok
+  def load_tape(%RLM.Replay.Tape{entries: entries}) do
+    Process.put(:rlm_replay_entries, entries)
+    :ok
+  end
+
+  defp pop_entry do
+    case Process.get(:rlm_replay_entries, []) do
+      [] ->
+        nil
+
+      [entry | rest] ->
+        Process.put(:rlm_replay_entries, rest)
+        entry
+    end
+  end
+end