Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Added

- Native tool / function calling across providers (#116). New `tools`, `tool_choice`, and `max_tool_iterations` fields on `llm_call`; the LLM step dispatches via the existing `ToolRegistry`, feeds results back, and re-prompts until the model stops asking (capped by `max_tool_iterations`, default 5). Each adapter translates the unified `ToolDefinition` list to its native shape (OpenAI / Ollama / Anthropic / Google). Parallel tool calls dispatch concurrently and preserve order; failures are reported back as text so the model can recover. Sandbox (#105), budget (#108), and per-step retry (#106) apply unchanged. `MockProvider` recordings accept a list of turns per step so offline replay drives the loop end-to-end; `examples/35_tool_calling.yaml` ships a ReAct-style example. The OpenAI-shaped parser also handles Ollama-compat responses (no `type` field, `arguments` as a decoded dict) — without this Ollama tool calling silently dropped every call.
- Per-run experiment metadata logging (#77). Every workflow execution now writes a self-contained JSON record (`run_id`, ISO timestamp, AgentLoom version, Python version, workflow `sha256` hash, list of `provider/model` pairs used, status, total cost, total tokens, step count, duration) to `./agentloom_runs/<run_id>.json`. Override the directory via the `runs_dir` constructor argument on `RunHistoryWriter` or the `AGENTLOOM_RUNS_DIR` env var. Disk I/O happens in a worker thread so the write does not block the event loop. Records carry a `_schema_version: 1` field; failures during the write are logged and swallowed so a broken history directory cannot prevent the engine from returning the result. New `agentloom history` CLI subcommand lists records most-recent-first and accepts `--workflow`, `--provider`, `--since YYYY-MM-DD`, `--until YYYY-MM-DD`, `--min-cost`, `--max-cost`, `--limit`, and `--json` filters — covering the full filter surface (date, workflow, cost, provider) called for in the original issue.
- Quality annotations attachable to `WorkflowResult` (#59). New `WorkflowResult.annotate(target, quality_score=..., source=..., **metadata)` method appends a typed `QualityAnnotation` (`target`, `quality_score`, `source`, `metadata`) to the result so evaluators, human reviewers, or downstream scoring code can record output quality after the run completes. **The annotation is auto-emitted as an OTel span** the moment `annotate()` runs whenever the engine returned the result with a tracing context attached (the default for any workflow run with observability enabled) — `result.annotate("answer", quality_score=4.5, source="human_feedback")` becomes immediately visible in Jaeger with no additional plumbing on the caller side. Each annotation is published as a standalone `quality:<target>` span (the workflow span has already closed, so retroactive attribute attachment is not viable). Quality spans carry `workflow.run_id` and `workflow.name` plus `agentloom.quality.score`, `agentloom.quality.source`, `agentloom.quality.target`, and free-form `agentloom.quality.metadata.*` attributes — Jaeger / Tempo can group quality spans with the original trace by run_id, and dashboards can filter for `agentloom.quality.score < threshold` to surface regressions. Offline / replay paths that build a `WorkflowResult` without a live tracer keep working — `annotate()` still records the data on the result, the OTel emission just no-ops. The `agentloom.observability.quality.emit_quality_annotation` / `emit_quality_annotations` helpers remain available for callers that build annotations outside the engine flow (e.g. batch evaluators reading historical results from disk).
- OTel span and metric schema centralization with GenAI semantic conventions (#125). The schema is a clean break — no compatibility shims for pre-#125 attribute or metric names. New `agentloom.observability.schema` module is the single source of truth for span / attribute / metric names; downstream consumers (Grafana, AgentTest, Jaeger plugins) parse a stable contract instead of grepping for ad-hoc strings. **Metrics renamed and retyped** to match the OTel GenAI registry: `agentloom_tokens_total` (counter) → `gen_ai.client.token.usage` (histogram, `{token}` unit) with `gen_ai.token.type` attribute (`input` / `output` / `reasoning`); `agentloom_provider_latency_seconds` (histogram) → `gen_ai.client.operation.duration` (histogram, `s`) with `gen_ai.operation.name` + `gen_ai.provider.name` attributes; `agentloom_time_to_first_token_seconds` → `gen_ai.client.operation.time_to_first_chunk`. AgentLoom-specific metrics (`agentloom_workflow_*`, `agentloom_step_*`, `agentloom_provider_calls_total`, `agentloom_cost_usd_total`, `agentloom_circuit_breaker_state`, `agentloom_budget_remaining_usd`, HITL / mock / recording counters) keep their `agentloom_` prefix — they have no OTel equivalent. The bundled Grafana dashboard is updated to query the new metric / label names. The legacy `Observer.on_provider_call` hook (which duplicated the metric emission already done by `on_provider_call_end`) is removed; the engine no longer fires it. The `tokens: int` positional argument on `on_step_end` is removed — callers now pass `prompt_tokens` / `completion_tokens` as kwargs. Span attributes follow the **canonical OTel GenAI registry** as of the May 2026 spec — `gen_ai.provider.name` (the deprecated `gen_ai.system` is **not** emitted), `gen_ai.operation.name`, `gen_ai.request.model`, `gen_ai.request.temperature`, `gen_ai.request.max_tokens`, `gen_ai.request.stream`, `gen_ai.response.model`, `gen_ai.response.finish_reasons` (array of strings, per spec), `gen_ai.response.time_to_first_chunk`, `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`, `gen_ai.usage.reasoning.output_tokens`. Errored inference spans also emit the OTel general-conventions attribute `error.type` alongside the AgentLoom-specific `step.error` so OTel-aware consumers (Jaeger error filters, Tempo) light up. Inference spans use the canonical name template `"{operation_name} {model}"` (e.g. `"chat gpt-4o-mini"`); workflow / step orchestration spans keep the AgentLoom-specific `workflow:*` / `step:*` names. AgentLoom-specific fields stay under `workflow.*` / `step.*` / `tool.*` / `agentloom.*` namespaces. Provider names are translated from AgentLoom internal names to OTel registry values via `to_genai_provider_name` (e.g. `google` → `gcp.gemini`). Notable additions:
Expand Down
16 changes: 16 additions & 0 deletions docs/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -295,3 +295,19 @@ fixtures and CI.
# Depends on the recording captured from example 31
agentloom run examples/32_yaml_mock.yaml --lite
```

## Tool calling

### 35 — Native tool/function calling

ReAct-style agent: the model decides to invoke `http_request` against `httpbin.org/get`, receives the JSON, and emits a final natural-language answer. Sandbox is on with `allowed_domains: ["httpbin.org"]`, so the model-dispatched call goes through the same security policy as static `tool` steps (#105).

**Demonstrates:** `tools` declaration on `llm_call`, `tool_choice: auto`, `max_tool_iterations`, model-driven dispatch via `ToolRegistry`, sandboxed tool execution, replay support for tool-iteration loops.

```bash
# Mock-replay against the committed recording (no API calls)
agentloom run examples/35_tool_calling.yaml --lite

# Real call: pass --provider + --model to drive a live model
agentloom run examples/35_tool_calling.yaml --provider openai --model gpt-4o-mini
```
33 changes: 32 additions & 1 deletion docs/workflow-yaml.md
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,37 @@ Per-provider translation:

Reasoning tokens are billed at the output rate. `TokenUsage.reasoning_tokens` and `billable_completion_tokens` track the spend; `calculate_cost()` includes them automatically. See [Reasoning models](providers.md#reasoning-models) for per-provider details, including the Ollama caveat that `eval_count` is not split.

**Tool calling:**

The model can pick tools at runtime. Declare them on the step; the engine dispatches via the workflow's `ToolRegistry`, feeds results back, and re-prompts until the model stops asking for tools.

```yaml
- id: ask
type: llm_call
prompt: "What is the user's account balance?"
tools:
- name: lookup_account
description: "Retrieve account info by ID."
parameters:
type: object
properties:
account_id: { type: string }
required: [account_id]
tool_choice: auto # auto | required | none | {name: lookup_account}
max_tool_iterations: 5 # bound the loop; default 5
output: answer
```

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `tools` | `list[ToolDefinition]` | `[]` | Tool declarations the model can pick. `parameters` is JSON Schema. Names resolve against the registered `ToolRegistry`; an unknown name is reported back as a tool failure rather than aborting the loop. |
| `tool_choice` | `string \| dict` | `"auto"` | `"auto"` lets the model decide; `"required"` forces a call; `"none"` disables tools for this turn; `{"name": "..."}` pins to a specific tool. Anthropic ignores `"none"` (omits the field); Ollama ignores `tool_choice` entirely (model-side support decides). |
| `max_tool_iterations` | `int` | `5` | Cap on call→result→re-prompt loops. When hit, `finish_reason` becomes `"max_tool_iterations"` so callers can detect runaway behavior. |

The dispatched tool runs through the existing sandbox (#105), so `http_request`, `shell_command`, `file_read`, `file_write` honor the workflow's `sandbox:` config. Multiple tool calls in one response are dispatched concurrently (anyio task group); results preserve order in the conversation. Cost and tokens accumulate across iterations on the surfaced `StepResult`.

The legacy `tool` step (static DAG node, author chooses the tool) keeps working unchanged — `tools=` on `llm_call` is the new dynamic, model-driven path.

**Retry config:**

| Field | Type | Default | Description |
Expand Down Expand Up @@ -231,7 +262,7 @@ Evaluates conditions against state and activates a target step. Steps not activa

### `tool`

Executes a registered tool (shell command, HTTP request, etc.).
Executes a registered tool with author-chosen arguments — the workflow author decides which tool to call, not the model. For model-driven tool selection, use the `tools=` field on an `llm_call` step (see [tool calling](#llm_call) above).

```yaml
- id: fetch
Expand Down
51 changes: 51 additions & 0 deletions examples/35_tool_calling.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
name: tool-calling-agent
version: "1.0"
description: |
Demonstrates native tool/function calling. The model decides
at runtime to invoke ``http_request`` to fetch JSON from an API,
receives the result on the next turn, and emits a final answer.

Runs offline against the committed recording fixture. With
``--provider openai --model gpt-4o-mini`` it makes real calls and
the model picks the tool autonomously.

Producing a fresh recording:
agentloom run examples/35_tool_calling.yaml --provider openai \\
--model gpt-4o-mini --record recordings/tool_calling.json

config:
provider: mock
model: gpt-4o-mini
responses_file: recordings/tool_calling.json
latency_model: constant
latency_ms: 0
sandbox:
enabled: true
allow_network: true
allowed_domains: ["httpbin.org"]
allowed_schemes: ["https"]

state:
question: "What HTTP method does the httpbin /get endpoint accept?"

steps:
- id: ask
type: llm_call
prompt: "{state.question}"
tools:
- name: http_request
description: "Make an HTTP request to a given URL and return the response body."
parameters:
type: object
properties:
url:
type: string
description: "Full URL to GET."
method:
type: string
enum: ["GET", "POST"]
description: "HTTP method."
required: [url]
tool_choice: auto
max_tool_iterations: 3
output: answer
27 changes: 27 additions & 0 deletions recordings/tool_calling.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
{
"ask": [
{
"content": "",
"provider": "openai",
"model": "gpt-4o-mini",
"tool_calls": [
{
"id": "call_demo_1",
"name": "http_request",
"arguments": {"url": "https://httpbin.org/get", "method": "GET"}
}
],
"usage": {"prompt_tokens": 35, "completion_tokens": 18, "total_tokens": 53},
"cost_usd": 0.000018,
"finish_reason": "tool_calls"
},
{
"content": "The /get endpoint of httpbin accepts the GET HTTP method, as confirmed by the successful response from https://httpbin.org/get.",
"provider": "openai",
"model": "gpt-4o-mini",
"usage": {"prompt_tokens": 320, "completion_tokens": 26, "total_tokens": 346},
"cost_usd": 0.000061,
"finish_reason": "stop"
}
]
}
19 changes: 19 additions & 0 deletions src/agentloom/core/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,19 @@ class ThinkingConfig(BaseModel):
capture_reasoning: bool = True


class ToolDefinition(BaseModel):
"""LLM-callable tool declared on an ``llm_call`` step.

``parameters`` is a JSON Schema object; provider adapters translate it
to each API's native shape. ``name`` resolves against the workflow's
``tool_registry`` for dispatch.
"""

name: str
description: str = ""
parameters: dict[str, Any] = Field(default_factory=dict)


class StepDefinition(BaseModel):
"""Definition of a single workflow step."""

Expand Down Expand Up @@ -155,6 +168,12 @@ class StepDefinition(BaseModel):
# Reasoning / extended thinking
thinking: ThinkingConfig | None = None

# Tool calling — LLM picks tools at runtime.
# ``tool_choice``: ``"auto"`` | ``"required"`` | ``"none"`` | ``{"name": "..."}``.
tools: list[ToolDefinition] = Field(default_factory=list)
tool_choice: Any = "auto"
max_tool_iterations: int = 5
Comment thread
cchinchilla-dev marked this conversation as resolved.


class SandboxConfig(BaseModel):
"""Sandbox configuration for built-in tools.
Expand Down
43 changes: 43 additions & 0 deletions src/agentloom/observability/metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,8 @@ def __init__(
self._attachment_counter: Any = None
self._stream_counter: Any = None
self._ttft_histogram: Any = None
self._tool_call_counter: Any = None
self._tool_call_histogram: Any = None
self._mock_replay_counter: Any = None
self._recording_capture_counter: Any = None
self._recording_latency_histogram: Any = None
Expand Down Expand Up @@ -152,6 +154,18 @@ def _setup_otel(self, endpoint: str) -> None:
"agentloom_stream_responses_total",
description="Total streamed LLM responses",
)
# Tool calls dispatched by the model (#116). Tagged by tool name +
# status so dashboards can split successes from failures and spot
# tools that consistently fail or hang.
self._tool_call_counter = meter.create_counter(
"agentloom_tool_calls_total",
description="Total model-dispatched tool calls",
)
self._tool_call_histogram = meter.create_histogram(
"agentloom_tool_call_duration_seconds",
description="Tool-call execution duration",
unit="s",
)
# Canonical OTel GenAI metric — replaces the AgentLoom-prefixed
# ``agentloom_time_to_first_token_seconds`` with the spec name.
self._time_to_first_chunk_histogram = meter.create_histogram(
Expand Down Expand Up @@ -270,6 +284,17 @@ def _setup_prom(
"Total streamed LLM responses",
["provider", "model"],
)
self._prom_counters["tool_calls"] = prom.Counter(
"agentloom_tool_calls_total",
"Total model-dispatched tool calls",
["tool_name", "status"],
)
self._prom_histograms["tool_call_duration"] = prom.Histogram(
"agentloom_tool_call_duration_seconds",
"Tool-call execution duration",
["tool_name"],
buckets=[0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10, 30, 60],
)
self._prom_histograms["time_to_first_chunk"] = prom.Histogram(
"gen_ai_client_operation_time_to_first_chunk_seconds",
"GenAI streaming time-to-first-chunk (per OTel GenAI conventions)",
Expand Down Expand Up @@ -465,6 +490,24 @@ def record_attachments(self, step_type: str, count: int) -> None:
else: # pragma: no cover — prom fallback
self._prom_counters["attachments"].labels(step_type=step_type).inc(count)

def record_tool_call(self, tool_name: str, success: bool, duration_s: float) -> None:
"""Record a model-dispatched tool call (#116).

Counter is tagged ``status=success|failure`` so dashboards can
plot a per-tool failure rate; histogram tracks execution latency.
"""
if not self._enabled:
return
status = "success" if success else "failure"
if self._backend == "otel":
self._tool_call_counter.add(1, {"tool_name": tool_name, "status": status})
self._tool_call_histogram.record(duration_s, {"tool_name": tool_name})
else: # pragma: no cover — prom fallback
self._prom_counters["tool_calls"].labels(tool_name=tool_name, status=status).inc()
self._prom_histograms["tool_call_duration"].labels(tool_name=tool_name).observe(
duration_s
)

def record_stream_response(self, provider: str, model: str) -> None:
if not self._enabled:
return
Expand Down
3 changes: 3 additions & 0 deletions src/agentloom/observability/noop.py
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,9 @@ def on_provider_call_end(
def on_provider_error(self, provider: str, error_type: str, **kwargs: Any) -> None:
pass

def on_tool_call(self, **kwargs: Any) -> None:
pass

def on_stream_response(self, provider: str, model: str, ttft_s: float, **kwargs: Any) -> None:
pass

Expand Down
38 changes: 38 additions & 0 deletions src/agentloom/observability/observer.py
Original file line number Diff line number Diff line change
Expand Up @@ -304,6 +304,44 @@ def on_provider_call_end(
else:
span.end()

def on_tool_call(
self,
*,
step_id: str,
call_id: str,
tool_name: str,
args_hash: str,
result_hash: str,
duration_ms: float,
success: bool,
**kwargs: Any,
) -> None:
"""Record a model-dispatched tool call (#116).

Emits a child span under the active step span carrying the canonical
``execute_tool {name}`` name plus tool attrs, and records the
per-tool counter + histogram. Args / result are hashed (not raw)
so PII never lands on the trace.
"""
if self._metrics:
self._metrics.record_tool_call(tool_name, success, duration_ms / 1000.0)
if self._tracing:
attrs: dict[str, Any] = {
SpanAttr.TOOL_CALL_ID: call_id,
SpanAttr.TOOL_NAME: tool_name,
SpanAttr.TOOL_ARGS_HASH: args_hash,
SpanAttr.TOOL_RESULT_HASH: result_hash,
SpanAttr.TOOL_DURATION_MS: duration_ms,
SpanAttr.TOOL_SUCCESS: success,
}
if self._run_id:
attrs[SpanAttr.WORKFLOW_RUN_ID] = self._run_id
span = self._tracing.start_span(
SpanName.GEN_AI_TOOL_CALL.format(tool_name=tool_name),
attributes=attrs,
)
self._tracing.end_span(span)

def on_provider_error(
self,
provider: str,
Expand Down
8 changes: 8 additions & 0 deletions src/agentloom/observability/schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,9 +109,11 @@ class SpanAttr:
PROVIDER_ATTEMPT_OUTCOME = "agentloom.provider.attempt_outcome"

# Tool calls
TOOL_CALL_ID = "tool.call_id"
TOOL_NAME = "tool.name"
TOOL_ARGS_HASH = "tool.args_hash"
TOOL_RESULT_HASH = "tool.result_hash"
TOOL_DURATION_MS = "tool.duration_ms"
TOOL_SUCCESS = "tool.success"

# Prompt metadata (AgentLoom-specific, no full-prompt capture by default)
Expand Down Expand Up @@ -181,6 +183,12 @@ class MetricName:
# Streaming response counter (no OTel equivalent — AgentLoom-specific).
STREAM_RESPONSES_TOTAL = "agentloom_stream_responses_total"

# Tool calls dispatched by the model (#116). Counter tagged by tool
# name + status (success / failure); histogram captures execution
# latency per tool.
TOOL_CALLS_TOTAL = "agentloom_tool_calls_total"
TOOL_CALL_DURATION_SECONDS = "agentloom_tool_call_duration_seconds"

# Resilience gauges
CIRCUIT_BREAKER_STATE = "agentloom_circuit_breaker_state"
BUDGET_REMAINING_USD = "agentloom_budget_remaining_usd"
Expand Down
Loading
Loading