Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 56 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,39 @@ All notable changes to this project are documented here.

### Added

**Multi-provider LLM support via req_llm**

- `RLM.LLM.ReqLLM` — new default LLM backend that delegates to `req_llm` v1.6,
supporting Anthropic, OpenAI, Ollama (local models), Google Gemini, Groq, and any
other provider that `req_llm` supports. Model specs use the `"provider:model-name"`
convention (e.g., `"anthropic:claude-sonnet-4-6"`, `"ollama:qwen3.5:35b"`). Bare
model names without a provider prefix are treated as Anthropic for backward compat.
- `RLM.LLM.Anthropic` — the previous hand-rolled Anthropic Messages API client,
preserved as a fallback for users who need direct Anthropic-specific control.
Select via `llm_module: RLM.LLM.Anthropic`.
- `RLM.LLM` refactored to a pure behaviour module + shared utilities
(`extract_structured/1`, `response_schema/0`); no longer contains an implementation.
- `models` config field — `%{atom() => String.t()}` map of symbolic keys to
model specs. Default: `%{large: "claude-sonnet-4-6", small: "claude-haiku-4-5"}`.
Bare names are auto-prefixed with `"anthropic:"` by `ReqLLM`. Pass custom maps
for Ollama/OpenAI: `models: %{large: "ollama:qwen3.5:35b", small: "ollama:qwen3.5:9b"}`
- `RLM.Config.resolve_model/2` — looks up a model key in the `models` map
- `RLM.Config.context_window_for/2` — resolves context window size for a model key
(legacy fields for `:large`/`:small`, default 128k for custom keys)
- `model_key` option on Workers — replaces inline `config.model_large`/`config.model_small`
lookups with named model map resolution

### Changed

- Default `llm_module` changed from `RLM.LLM` (which was the implementation) to
`RLM.LLM.ReqLLM` (the new multi-provider adapter)
- API key resolution now checks `ANTHROPIC_API_KEY` first, falls back to `CLAUDE_API_KEY`
- `RLM.Worker` uses `model_key` (`:large`, `:small`, or custom atom) to resolve model
specs via `Config.resolve_model/2` instead of reading `config.model_large`/`model_small`
- `RLM.run/3`, `RLM.run_async/3`, `RLM.start_session/1`, `RLM.Replay.replay/2` pass
`model_key:` instead of `model:` in worker opts
- `req_llm` (`~> 1.6`) added as a dependency

**Deterministic replay**

- `RLM.Replay` — replay orchestrator that re-executes a previously recorded run using
Expand All @@ -33,9 +66,32 @@ All notable changes to this project are documented here.
then falls back to a live LLM module when the tape is exhausted
- `:fallback` option on `RLM.replay/2` — `:error` (default) or `:live` to switch
to live LLM calls when the tape runs out (e.g., because a patch caused extra iterations)
- `examples/local_models.exs` — new example demonstrating Ollama/local model usage
with no API key required. Registered as `mix rlm.examples local_models`
- `test/rlm/config_test.exs` — 16 new unit tests for `Config.load/1`,
`Config.resolve_model/2`, and `Config.context_window_for/2`
- 17 tests covering recording, tape construction, replay LLM, replay orchestration,
patching, fallback behavior, and the public API

### Fixed

- `RLM.LLM.ReqLLM.encode_object/1` now returns an explicit error instead of silently
falling back to an empty string when the LLM response contains no usable content
- `RLM.LLM.ReqLLM.extract_usage/1` logs a warning when token usage extraction fails
(all fields nil despite non-empty response), preventing silent zero-cost reporting
- `RLM.Replay.Tape.get_events/1` now catches `:noproc` exits specifically and logs
a warning for unexpected exit reasons, instead of broadly swallowing all exits
- `RLM.Replay.FallbackLLM` now logs when switching from tape replay to live LLM calls
- `RLM.Config.context_window_for/2` now logs a warning when using the 128k default
for custom model keys, making it easier to diagnose compaction behavior
- `RLM.Replay` moduledoc corrected: fallback default is `RLM.LLM.ReqLLM` (not `RLM.LLM`)
- `RLM.Worker` moduledoc updated to be provider-agnostic (no longer references "Claude's
output_config" specifically)
- `CLAUDE.md` — removed stale `cost_per_1k_*` config fields; fixed `models` default to
match actual bare-name defaults; updated env var references to `ANTHROPIC_API_KEY`
- All examples updated from `CLAUDE_API_KEY` to `ANTHROPIC_API_KEY`; smoke test checks
both env vars

**Distributed Erlang node support**

- `RLM.Node` — lightweight wrapper for OTP distribution with three public functions:
Expand Down
139 changes: 116 additions & 23 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,10 @@ rlm/
│ │ ├── run.ex # Per-run coordinator GenServer
│ │ ├── worker.ex # RLM GenServer (iterate loop + keep_alive)
│ │ ├── eval.ex # Sandboxed Code.eval_string
│ │ ├── llm.ex # Anthropic Messages API client
│ │ ├── llm.ex # LLM behaviour + shared utilities
│ │ ├── llm/
│ │ │ ├── req_llm.ex # Multi-provider backend via req_llm (default)
│ │ │ └── anthropic.ex # Direct Anthropic API client (legacy fallback)
│ │ ├── helpers.ex # chunks/2, grep/2, preview/2, list_bindings/0
│ │ ├── sandbox.ex # Eval sandbox: helpers + LLM calls + tool wrappers
│ │ ├── prompt.ex # System prompt + message formatting
Expand Down Expand Up @@ -83,10 +86,10 @@ mix test
# Run tests with trace output
mix test --trace

# Run live API tests (requires CLAUDE_API_KEY env var)
# Run live API tests (requires ANTHROPIC_API_KEY or CLAUDE_API_KEY env var)
mix test --include live_api

# Live smoke test (requires CLAUDE_API_KEY env var)
# Live smoke test (requires ANTHROPIC_API_KEY or CLAUDE_API_KEY env var)
mix rlm.smoke

# Interactive shell
Expand Down Expand Up @@ -162,15 +165,27 @@ retrieve the execution trace via `RLM.EventLog`. On failure it returns `{:error,
A `Process.monitor` on the Worker ensures crashes surface as errors rather than hangs.

### LLM Client
Uses the Anthropic Messages API (not OpenAI format). System messages are
extracted and sent as the top-level `system` field. Requires `CLAUDE_API_KEY` env var.
The default backend is `RLM.LLM.ReqLLM`, which delegates to the `req_llm` package
and supports any provider: Anthropic, OpenAI, Ollama (local), Gemini, Groq, etc.
Model specs use the `"provider:model-name"` convention (e.g., `"anthropic:claude-sonnet-4-6"`,
`"ollama:qwen3.5:35b"`). Bare names without a prefix are treated as Anthropic for
backward compatibility. Requires `ANTHROPIC_API_KEY` (or `CLAUDE_API_KEY` as fallback).

LLM responses use structured output (`output_config` with `json_schema`) to constrain
responses to `{"reasoning": "...", "code": "..."}` JSON objects. This eliminates regex-based
code extraction and provides clean separation of reasoning from executable code. Feedback
messages after eval are also structured JSON.
The legacy hand-rolled Anthropic client is preserved as `RLM.LLM.Anthropic` and can
be selected via `llm_module: RLM.LLM.Anthropic`.

Default models:
LLM responses use structured output (JSON schema) to constrain responses to
`{"reasoning": "...", "code": "..."}` objects. Feedback messages after eval are also
structured JSON.

The `models` config field maps symbolic keys to model specs:

```elixir
RLM.run(context, query,
models: %{large: "ollama:qwen3.5:35b", small: "ollama:qwen3.5:9b"})
```

Default models (bare names; `ReqLLM` auto-prefixes with `"anthropic:"`):
- Large: `claude-sonnet-4-6`
- Small: `claude-haiku-4-5`

Expand All @@ -186,7 +201,9 @@ Default models:
| `RLM.Worker` | GenServer per execution node; iterate loop + keep_alive mode; delegates spawning to Run |
| `RLM.Eval` | Sandboxed `Code.eval_string` with async IO capture + cwd injection |
| `RLM.Sandbox` | Functions injected into eval'd code (helpers + LLM calls + tool wrappers) |
| `RLM.LLM` | Anthropic Messages API client with structured output (`extract_structured/1`) |
| `RLM.LLM` | LLM behaviour + shared utilities (`extract_structured/1`, `response_schema/0`) |
| `RLM.LLM.ReqLLM` | Multi-provider LLM backend via `req_llm` (default) |
| `RLM.LLM.Anthropic` | Direct Anthropic Messages API client (legacy fallback) |
| `RLM.Prompt` | System prompt loading + structured JSON feedback message formatting |
| `RLM.Helpers` | `chunks/2`, `grep/2`, `preview/2`, `list_bindings/0` |
| `RLM.Truncate` | Head+tail string truncation for stdout overflow |
Expand Down Expand Up @@ -246,9 +263,10 @@ Read-only Phoenix LiveView dashboard. Serves on `http://localhost:4000`.
| Field | Default | Notes |
|---|---|---|
| `api_base_url` | `"https://api.anthropic.com"` | Anthropic API base URL |
| `api_key` | `CLAUDE_API_KEY` env var | API key for LLM requests |
| `model_large` | `claude-sonnet-4-6` | Used for parent workers |
| `model_small` | `claude-haiku-4-5` | Used for subcalls |
| `api_key` | `ANTHROPIC_API_KEY` env var | API key for LLM requests (falls back to `CLAUDE_API_KEY`) |
| `models` | `%{large: "claude-sonnet-4-6", small: "claude-haiku-4-5"}` | Named model map; keys are atoms, values are model specs. Bare names are auto-prefixed with `"anthropic:"` by `ReqLLM` |
| `model_large` | `claude-sonnet-4-6` | Legacy; used to build default `models` map |
| `model_small` | `claude-haiku-4-5` | Legacy; used to build default `models` map |
| `max_iterations` | `25` | Per-worker LLM turn limit |
| `max_depth` | `5` | Recursive subcall depth limit |
| `max_concurrent_subcalls` | `10` | Parallel subcall limit per worker |
Expand All @@ -259,23 +277,19 @@ Read-only Phoenix LiveView dashboard. Serves on `http://localhost:4000`.
| `eval_timeout` | `300_000` | ms per eval (5 min) |
| `llm_timeout` | `120_000` | ms per LLM request (2 min) |
| `subcall_timeout` | `600_000` | ms per subcall (10 min) |
| `cost_per_1k_prompt_tokens_large` | `0.003` | Cost tracking for large model input |
| `cost_per_1k_prompt_tokens_small` | `0.0008` | Cost tracking for small model input |
| `cost_per_1k_completion_tokens_large` | `0.015` | Cost tracking for large model output |
| `cost_per_1k_completion_tokens_small` | `0.004` | Cost tracking for small model output |
| `enable_otel` | `false` | Enable OpenTelemetry integration |
| `enable_event_log` | `true` | Enable per-run EventLog trace agents |
| `event_log_capture_full_stdout` | `false` | Store full stdout in traces (vs truncated) |
| `enable_replay_recording` | `false` | Record full LLM responses for deterministic replay |
| `llm_module` | `RLM.LLM` | Swappable for `RLM.Test.MockLLM` |
| `llm_module` | `RLM.LLM.ReqLLM` | Default LLM backend; swap to `RLM.LLM.Anthropic` or `RLM.Test.MockLLM` |

## Testing Conventions

- Tests use `RLM.Test.MockLLM` (global ETS-based response queue) for deterministic testing
- Worker/keep_alive tests run `async: false` since MockLLM uses global state
- Tool tests and sandbox tests can run `async: true` (no global state)
- Live API tests tagged with `@moduletag :live_api` and excluded by default
- `mix test --include live_api` requires `CLAUDE_API_KEY` env var
- `mix test --include live_api` requires `ANTHROPIC_API_KEY` (or `CLAUDE_API_KEY`) env var
- Test support files in `test/support/`
- Tool tests use a per-test temp directory (created in `setup`, cleaned in `on_exit`)
- Worker concurrency/depth tests use `RLM.Test.Helpers.start_test_run/1` to create a Run, then spawn Workers via `RLM.Run.start_worker/2`
Expand All @@ -286,7 +300,7 @@ Read-only Phoenix LiveView dashboard. Serves on `http://localhost:4000`.
- Workers use `restart: :temporary` — they terminate normally after completion
- The `llm_module` config field enables dependency injection for testing
- Bash tool uses `Task.async` + `Task.yield/2` (not `System.cmd` — it has no `:timeout` option)
- `.env` file with `CLAUDE_API_KEY` should exist at project root but must not be committed
- `.env` file with `ANTHROPIC_API_KEY` (or `CLAUDE_API_KEY`) should exist at project root but must not be committed
- `RLM.run/3` monitors the Worker with `Process.monitor` so crashes return `{:error, reason}`
rather than hanging indefinitely

Expand All @@ -309,20 +323,99 @@ The dashboard is a Phoenix 1.8 LiveView application. Key conventions:

## Orientation for Coding Agents

### Getting Started

When starting a task, read these files in order:

1. **`CLAUDE.md`** (this file) — architecture, invariants, module map
2. **`config/config.exs`** — runtime defaults
3. The specific module(s) relevant to your task (see Module Map above)
4. The corresponding test file to understand expected behaviour

Key invariants **never to break**:
### Key Invariants (Never Break These)

- Raw input data must not enter any LLM context window (use `preview/2` or metadata only)
- Workers are `:temporary` — do not change their restart strategy
- The async-eval pattern in `RLM.Worker` is intentional; do not make eval synchronous
- All session tests must use `async: false` (MockLLM is global ETS state)
- Run → Worker communication is always `send/2`, never `GenServer.call` (deadlock prevention)

### Key Contracts & Interfaces

**LLM Behaviour** (`RLM.LLM`):
```elixir
@callback chat(messages :: [map()], model :: String.t(), config :: RLM.Config.t(), opts :: keyword()) ::
{:ok, json_string :: String.t(), usage :: usage()} | {:error, String.t()}
```
All LLM modules (`ReqLLM`, `Anthropic`, `MockLLM`, `Replay.LLM`, `Replay.FallbackLLM`) implement
this same callback. The `json_string` return is always a JSON-encoded string, never a parsed map.

**Usage type**: `%{prompt_tokens: integer | nil, completion_tokens: integer | nil, total_tokens: integer | nil, cache_creation_input_tokens: integer | nil, cache_read_input_tokens: integer | nil}`

**Model resolution**: Use `RLM.Config.resolve_model(config, :large | :small | atom())` → `{:ok, "provider:model-name"}` or `{:error, reason}`. In Worker, use `resolve_model!/2` (raises on unknown keys).

**Tool Behaviour** (`RLM.Tool`):
```elixir
@callback name() :: String.t()
@callback description() :: String.t()
@callback execute(map()) :: {:ok, String.t()} | {:error, String.t()}
```

### Dependency Injection Pattern

The `llm_module` config field is the primary injection point:
- **Production**: `RLM.LLM.ReqLLM` (default) — multi-provider via `req_llm`
- **Testing**: `RLM.Test.MockLLM` — ETS-based response queue, set in `config/test.exs`
- **Legacy**: `RLM.LLM.Anthropic` — direct Anthropic HTTP client
- **Replay**: `RLM.Replay.LLM` / `RLM.Replay.FallbackLLM` — tape-based, set by `RLM.Replay`

When adding a new LLM feature, implement it in the behaviour callback — the Worker
calls `config.llm_module.chat(...)` and is provider-agnostic.

### Testing Patterns

**MockLLM usage** — queue expected responses before running Workers:
```elixir
RLM.Test.MockLLM.enqueue(%{
"reasoning" => "I'll count the lines",
"code" => ~s(final_answer = 4)
})
```
MockLLM is global ETS state. Tests using it must be `async: false`.

**Creating a test Run** — use `RLM.Test.Helpers.start_test_run/1`:
```elixir
{run_pid, run_id} = RLM.Test.Helpers.start_test_run(config)
{:ok, worker_pid, span_id} = RLM.Run.start_worker(run_pid, worker_opts)
```

**Tool tests** — use per-test temp dirs (created in `setup`, cleaned in `on_exit`);
these can run `async: true` since tools have no global state.

### Common Modification Patterns

**Adding a new config field:**
1. Add to `defstruct` in `config.ex`
2. Add to `load/1` with `get(overrides, :key, default)`
3. Add row to CLAUDE.md Config Fields table
4. Add to CHANGELOG.md

**Adding a new tool:**
1. Create `lib/rlm/tools/my_tool.ex` implementing `RLM.Tool`
2. Add to `RLM.ToolRegistry.all/0`
3. Add wrapper function to `RLM.Sandbox`
4. Add to system prompt in `priv/system_prompt.md`
5. Add row to CLAUDE.md Module Map (Filesystem Tools section)

**Adding a new LLM behaviour implementation:**
1. Create module with `@behaviour RLM.LLM`
2. Implement `chat/4` returning `{:ok, json_string, usage}` or `{:error, string}`
3. Users select it via `llm_module:` config override
4. Add row to CLAUDE.md Module Map

### Before Committing

Before committing, always run:
Always run:
```bash
mix compile --warnings-as-errors
mix test
Expand Down
19 changes: 13 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,9 @@ I wanted to take further, and the design philosophy behind
[pi](https://github.com/badlogic/pi-mono/) — a coding agent that keeps things simple and
transparent. This is very much a learning project, but it works and it's been fun to build.

A single Phoenix application: an AI execution engine where Claude writes Elixir code that
A single Phoenix application: an AI execution engine where LLMs write Elixir code that
runs in a persistent REPL, with recursive sub-agent spawning and built-in filesystem tools.
Supports multiple LLM providers via `req_llm`: Anthropic, OpenAI, Ollama (local), Gemini, and more.

**One engine, two modes:**
1. **One-shot** — `RLM.run/3` processes data and returns a result
Expand Down Expand Up @@ -81,7 +82,7 @@ Three invariants the engine enforces:
Requires Elixir ≥ 1.19 / OTP 27 and an [Anthropic API key](https://console.anthropic.com/).

```bash
export CLAUDE_API_KEY=sk-ant-...
export ANTHROPIC_API_KEY=sk-ant-... # or CLAUDE_API_KEY as fallback
mix deps.get && mix compile
mix test # excludes live API tests
iex -S mix # interactive shell
Expand Down Expand Up @@ -136,12 +137,18 @@ watch(session) # attach a live telemetry stream
### Configuration overrides

```elixir
# Use custom Anthropic models
{:ok, result, run_id} = RLM.run(context, query,
max_iterations: 10,
max_depth: 3,
model_large: "claude-opus-4-6",
models: %{large: "anthropic:claude-opus-4-6", small: "anthropic:claude-haiku-4-5"},
eval_timeout: 60_000
)

# Use local Ollama models (no API key needed)
{:ok, result, run_id} = RLM.run(context, query,
models: %{large: "ollama:qwen3.5:35b", small: "ollama:qwen3.5:9b"}
)
```

### Deterministic replay
Expand Down Expand Up @@ -356,9 +363,9 @@ RLM_COOKIE=secret # shared secret for node authentication

RLM executes LLM-generated Elixir code via `Code.eval_string` with full access to the
host filesystem, network, and shell. **Do not expose RLM to untrusted users or untrusted
LLM providers.** It is designed for local development, trusted API backends (Anthropic),
and controlled environments. There is no sandboxing beyond process-level isolation and
configurable timeouts.
LLM providers.** It is designed for local development, trusted API backends (Anthropic,
OpenAI, local Ollama), and controlled environments. There is no sandboxing beyond
process-level isolation and configurable timeouts.

---

Expand Down
9 changes: 6 additions & 3 deletions config/runtime.exs
Original file line number Diff line number Diff line change
Expand Up @@ -35,12 +35,15 @@ if config_env() == :prod do
You can generate one by calling: mix phx.gen.secret
"""

# CLAUDE_API_KEY is required in prod for LLM calls.
System.get_env("CLAUDE_API_KEY") ||
# An API key is required in prod for LLM calls.
# ANTHROPIC_API_KEY is preferred; CLAUDE_API_KEY is accepted as a fallback.
unless System.get_env("ANTHROPIC_API_KEY") || System.get_env("CLAUDE_API_KEY") do
raise """
environment variable CLAUDE_API_KEY is missing.
environment variable ANTHROPIC_API_KEY is missing.
Set it to your Anthropic API key to enable LLM functionality.
(CLAUDE_API_KEY is also accepted as a fallback.)
"""
end

host = System.get_env("PHX_HOST") || "example.com"

Expand Down
2 changes: 1 addition & 1 deletion examples/code_review.exs
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
# - Filesystem tool usage visible in code blocks
#
# Usage:
# export CLAUDE_API_KEY=sk-ant-...
# export ANTHROPIC_API_KEY=sk-ant-...
# mix run examples/code_review.exs
#
# Or via the Mix task:
Expand Down
Loading