cchinchilla-dev · cchinchilla-dev · May 8, 2026 · May 8, 2026 · May 9, 2026 · May 10, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,6 +9,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Added
 
+- Native tool / function calling across providers (#116). New `tools`, `tool_choice`, and `max_tool_iterations` fields on `llm_call`; the LLM step dispatches via the existing `ToolRegistry`, feeds results back, and re-prompts until the model stops asking (capped by `max_tool_iterations`, default 5). Each adapter translates the unified `ToolDefinition` list to its native shape (OpenAI / Ollama / Anthropic / Google). Parallel tool calls dispatch concurrently and preserve order; failures are reported back as text so the model can recover. Sandbox (#105), budget (#108), and per-step retry (#106) apply unchanged. `MockProvider` recordings accept a list of turns per step so offline replay drives the loop end-to-end; `examples/35_tool_calling.yaml` ships a ReAct-style example. The OpenAI-shaped parser also handles Ollama-compat responses (no `type` field, `arguments` as a decoded dict) — without this Ollama tool calling silently dropped every call.
 - Per-run experiment metadata logging (#77). Every workflow execution now writes a self-contained JSON record (`run_id`, ISO timestamp, AgentLoom version, Python version, workflow `sha256` hash, list of `provider/model` pairs used, status, total cost, total tokens, step count, duration) to `./agentloom_runs/<run_id>.json`. Override the directory via the `runs_dir` constructor argument on `RunHistoryWriter` or the `AGENTLOOM_RUNS_DIR` env var. Disk I/O happens in a worker thread so the write does not block the event loop. Records carry a `_schema_version: 1` field; failures during the write are logged and swallowed so a broken history directory cannot prevent the engine from returning the result. New `agentloom history` CLI subcommand lists records most-recent-first and accepts `--workflow`, `--provider`, `--since YYYY-MM-DD`, `--until YYYY-MM-DD`, `--min-cost`, `--max-cost`, `--limit`, and `--json` filters — covering the full filter surface (date, workflow, cost, provider) called for in the original issue.
 - Quality annotations attachable to `WorkflowResult` (#59). New `WorkflowResult.annotate(target, quality_score=..., source=..., **metadata)` method appends a typed `QualityAnnotation` (`target`, `quality_score`, `source`, `metadata`) to the result so evaluators, human reviewers, or downstream scoring code can record output quality after the run completes. **The annotation is auto-emitted as an OTel span** the moment `annotate()` runs whenever the engine returned the result with a tracing context attached (the default for any workflow run with observability enabled) — `result.annotate("answer", quality_score=4.5, source="human_feedback")` becomes immediately visible in Jaeger with no additional plumbing on the caller side. Each annotation is published as a standalone `quality:<target>` span (the workflow span has already closed, so retroactive attribute attachment is not viable). Quality spans carry `workflow.run_id` and `workflow.name` plus `agentloom.quality.score`, `agentloom.quality.source`, `agentloom.quality.target`, and free-form `agentloom.quality.metadata.*` attributes — Jaeger / Tempo can group quality spans with the original trace by run_id, and dashboards can filter for `agentloom.quality.score < threshold` to surface regressions. Offline / replay paths that build a `WorkflowResult` without a live tracer keep working — `annotate()` still records the data on the result, the OTel emission just no-ops. The `agentloom.observability.quality.emit_quality_annotation` / `emit_quality_annotations` helpers remain available for callers that build annotations outside the engine flow (e.g. batch evaluators reading historical results from disk).
 - OTel span and metric schema centralization with GenAI semantic conventions (#125). The schema is a clean break — no compatibility shims for pre-#125 attribute or metric names. New `agentloom.observability.schema` module is the single source of truth for span / attribute / metric names; downstream consumers (Grafana, AgentTest, Jaeger plugins) parse a stable contract instead of grepping for ad-hoc strings. **Metrics renamed and retyped** to match the OTel GenAI registry: `agentloom_tokens_total` (counter) → `gen_ai.client.token.usage` (histogram, `{token}` unit) with `gen_ai.token.type` attribute (`input` / `output` / `reasoning`); `agentloom_provider_latency_seconds` (histogram) → `gen_ai.client.operation.duration` (histogram, `s`) with `gen_ai.operation.name` + `gen_ai.provider.name` attributes; `agentloom_time_to_first_token_seconds` → `gen_ai.client.operation.time_to_first_chunk`. AgentLoom-specific metrics (`agentloom_workflow_*`, `agentloom_step_*`, `agentloom_provider_calls_total`, `agentloom_cost_usd_total`, `agentloom_circuit_breaker_state`, `agentloom_budget_remaining_usd`, HITL / mock / recording counters) keep their `agentloom_` prefix — they have no OTel equivalent. The bundled Grafana dashboard is updated to query the new metric / label names. The legacy `Observer.on_provider_call` hook (which duplicated the metric emission already done by `on_provider_call_end`) is removed; the engine no longer fires it. The `tokens: int` positional argument on `on_step_end` is removed — callers now pass `prompt_tokens` / `completion_tokens` as kwargs. Span attributes follow the **canonical OTel GenAI registry** as of the May 2026 spec — `gen_ai.provider.name` (the deprecated `gen_ai.system` is **not** emitted), `gen_ai.operation.name`, `gen_ai.request.model`, `gen_ai.request.temperature`, `gen_ai.request.max_tokens`, `gen_ai.request.stream`, `gen_ai.response.model`, `gen_ai.response.finish_reasons` (array of strings, per spec), `gen_ai.response.time_to_first_chunk`, `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`, `gen_ai.usage.reasoning.output_tokens`. Errored inference spans also emit the OTel general-conventions attribute `error.type` alongside the AgentLoom-specific `step.error` so OTel-aware consumers (Jaeger error filters, Tempo) light up. Inference spans use the canonical name template `"{operation_name} {model}"` (e.g. `"chat gpt-4o-mini"`); workflow / step orchestration spans keep the AgentLoom-specific `workflow:*` / `step:*` names. AgentLoom-specific fields stay under `workflow.*` / `step.*` / `tool.*` / `agentloom.*` namespaces. Provider names are translated from AgentLoom internal names to OTel registry values via `to_genai_provider_name` (e.g. `google` → `gcp.gemini`). Notable additions:

diff --git a/docs/examples.md b/docs/examples.md
@@ -295,3 +295,19 @@ fixtures and CI.
 # Depends on the recording captured from example 31
 agentloom run examples/32_yaml_mock.yaml --lite
 ```
+
+## Tool calling
+
+### 35 — Native tool/function calling
+
+ReAct-style agent: the model decides to invoke `http_request` against `httpbin.org/get`, receives the JSON, and emits a final natural-language answer. Sandbox is on with `allowed_domains: ["httpbin.org"]`, so the model-dispatched call goes through the same security policy as static `tool` steps (#105).
+
+**Demonstrates:** `tools` declaration on `llm_call`, `tool_choice: auto`, `max_tool_iterations`, model-driven dispatch via `ToolRegistry`, sandboxed tool execution, replay support for tool-iteration loops.
+
+```bash
+# Mock-replay against the committed recording (no API calls)
+agentloom run examples/35_tool_calling.yaml --lite
+
+# Real call: pass --provider + --model to drive a live model
+agentloom run examples/35_tool_calling.yaml --provider openai --model gpt-4o-mini
+```
diff --git a/docs/workflow-yaml.md b/docs/workflow-yaml.md
@@ -191,6 +191,37 @@ Per-provider translation:
 
 Reasoning tokens are billed at the output rate. `TokenUsage.reasoning_tokens` and `billable_completion_tokens` track the spend; `calculate_cost()` includes them automatically. See [Reasoning models](providers.md#reasoning-models) for per-provider details, including the Ollama caveat that `eval_count` is not split.
 
+**Tool calling:**
+
+The model can pick tools at runtime. Declare them on the step; the engine dispatches via the workflow's `ToolRegistry`, feeds results back, and re-prompts until the model stops asking for tools.
+
+```yaml
+- id: ask
+  type: llm_call
+  prompt: "What is the user's account balance?"
+  tools:
+    - name: lookup_account
+      description: "Retrieve account info by ID."
+      parameters:
+        type: object
+        properties:
+          account_id: { type: string }
+        required: [account_id]
+  tool_choice: auto              # auto | required | none | {name: lookup_account}
+  max_tool_iterations: 5         # bound the loop; default 5
+  output: answer
+```
+
+| Field | Type | Default | Description |
+|-------|------|---------|-------------|
+| `tools` | `list[ToolDefinition]` | `[]` | Tool declarations the model can pick. `parameters` is JSON Schema. Names resolve against the registered `ToolRegistry`; an unknown name is reported back as a tool failure rather than aborting the loop. |
+| `tool_choice` | `string \| dict` | `"auto"` | `"auto"` lets the model decide; `"required"` forces a call; `"none"` disables tools for this turn; `{"name": "..."}` pins to a specific tool. Anthropic ignores `"none"` (omits the field); Ollama ignores `tool_choice` entirely (model-side support decides). |
+| `max_tool_iterations` | `int` | `5` | Cap on call→result→re-prompt loops. When hit, `finish_reason` becomes `"max_tool_iterations"` so callers can detect runaway behavior. |
+
+The dispatched tool runs through the existing sandbox (#105), so `http_request`, `shell_command`, `file_read`, `file_write` honor the workflow's `sandbox:` config. Multiple tool calls in one response are dispatched concurrently (anyio task group); results preserve order in the conversation. Cost and tokens accumulate across iterations on the surfaced `StepResult`.
+
+The legacy `tool` step (static DAG node, author chooses the tool) keeps working unchanged — `tools=` on `llm_call` is the new dynamic, model-driven path.
+
 **Retry config:**
 
 | Field | Type | Default | Description |
@@ -231,7 +262,7 @@ Evaluates conditions against state and activates a target step. Steps not activa
 
 ### `tool`
 
-Executes a registered tool (shell command, HTTP request, etc.).
+Executes a registered tool with author-chosen arguments — the workflow author decides which tool to call, not the model. For model-driven tool selection, use the `tools=` field on an `llm_call` step (see [tool calling](#llm_call) above).
 
 ```yaml
 - id: fetch

diff --git a/examples/35_tool_calling.yaml b/examples/35_tool_calling.yaml
@@ -0,0 +1,51 @@
+name: tool-calling-agent
+version: "1.0"
+description: |
+  Demonstrates native tool/function calling. The model decides
+  at runtime to invoke ``http_request`` to fetch JSON from an API,
+  receives the result on the next turn, and emits a final answer.
+
+  Runs offline against the committed recording fixture. With
+  ``--provider openai --model gpt-4o-mini`` it makes real calls and
+  the model picks the tool autonomously.
+
+  Producing a fresh recording:
+    agentloom run examples/35_tool_calling.yaml --provider openai \\
+      --model gpt-4o-mini --record recordings/tool_calling.json
+
+config:
+  provider: mock
+  model: gpt-4o-mini
+  responses_file: recordings/tool_calling.json
+  latency_model: constant
+  latency_ms: 0
+  sandbox:
+    enabled: true
+    allow_network: true
+    allowed_domains: ["httpbin.org"]
+    allowed_schemes: ["https"]
+
+state:
+  question: "What HTTP method does the httpbin /get endpoint accept?"
+
+steps:
+  - id: ask
+    type: llm_call
+    prompt: "{state.question}"
+    tools:
+      - name: http_request
+        description: "Make an HTTP request to a given URL and return the response body."
+        parameters:
+          type: object
+          properties:
+            url:
+              type: string
+              description: "Full URL to GET."
+            method:
+              type: string
+              enum: ["GET", "POST"]
+              description: "HTTP method."
+          required: [url]
+    tool_choice: auto
+    max_tool_iterations: 3
+    output: answer
diff --git a/recordings/tool_calling.json b/recordings/tool_calling.json
@@ -0,0 +1,27 @@
+{
+  "ask": [
+    {
+      "content": "",
+      "provider": "openai",
+      "model": "gpt-4o-mini",
+      "tool_calls": [
+        {
+          "id": "call_demo_1",
+          "name": "http_request",
+          "arguments": {"url": "https://httpbin.org/get", "method": "GET"}
+        }
+      ],
+      "usage": {"prompt_tokens": 35, "completion_tokens": 18, "total_tokens": 53},
+      "cost_usd": 0.000018,
+      "finish_reason": "tool_calls"
+    },
+    {
+      "content": "The /get endpoint of httpbin accepts the GET HTTP method, as confirmed by the successful response from https://httpbin.org/get.",
+      "provider": "openai",
+      "model": "gpt-4o-mini",
+      "usage": {"prompt_tokens": 320, "completion_tokens": 26, "total_tokens": 346},
+      "cost_usd": 0.000061,
+      "finish_reason": "stop"
+    }
+  ]
+}
diff --git a/src/agentloom/core/models.py b/src/agentloom/core/models.py
@@ -108,6 +108,19 @@ class ThinkingConfig(BaseModel):
     capture_reasoning: bool = True
 
 
+class ToolDefinition(BaseModel):
+    """LLM-callable tool declared on an ``llm_call`` step.
+
+    ``parameters`` is a JSON Schema object; provider adapters translate it
+    to each API's native shape. ``name`` resolves against the workflow's
+    ``tool_registry`` for dispatch.
+    """
+
+    name: str
+    description: str = ""
+    parameters: dict[str, Any] = Field(default_factory=dict)
+
+
 class StepDefinition(BaseModel):
     """Definition of a single workflow step."""
 
@@ -155,6 +168,12 @@ class StepDefinition(BaseModel):
     # Reasoning / extended thinking
     thinking: ThinkingConfig | None = None
 
+    # Tool calling — LLM picks tools at runtime.
+    # ``tool_choice``: ``"auto"`` | ``"required"`` | ``"none"`` | ``{"name": "..."}``.
+    tools: list[ToolDefinition] = Field(default_factory=list)
+    tool_choice: Any = "auto"
+    max_tool_iterations: int = 5
+
 
 class SandboxConfig(BaseModel):
     """Sandbox configuration for built-in tools.

diff --git a/src/agentloom/observability/metrics.py b/src/agentloom/observability/metrics.py
@@ -64,6 +64,8 @@ def __init__(
         self._attachment_counter: Any = None
         self._stream_counter: Any = None
         self._ttft_histogram: Any = None
+        self._tool_call_counter: Any = None
+        self._tool_call_histogram: Any = None
         self._mock_replay_counter: Any = None
         self._recording_capture_counter: Any = None
         self._recording_latency_histogram: Any = None
@@ -152,6 +154,18 @@ def _setup_otel(self, endpoint: str) -> None:
             "agentloom_stream_responses_total",
             description="Total streamed LLM responses",
         )
+        # Tool calls dispatched by the model (#116). Tagged by tool name +
+        # status so dashboards can split successes from failures and spot
+        # tools that consistently fail or hang.
+        self._tool_call_counter = meter.create_counter(
+            "agentloom_tool_calls_total",
+            description="Total model-dispatched tool calls",
+        )
+        self._tool_call_histogram = meter.create_histogram(
+            "agentloom_tool_call_duration_seconds",
+            description="Tool-call execution duration",
+            unit="s",
+        )
         # Canonical OTel GenAI metric — replaces the AgentLoom-prefixed
         # ``agentloom_time_to_first_token_seconds`` with the spec name.
         self._time_to_first_chunk_histogram = meter.create_histogram(
@@ -270,6 +284,17 @@ def _setup_prom(
             "Total streamed LLM responses",
             ["provider", "model"],
         )
+        self._prom_counters["tool_calls"] = prom.Counter(
+            "agentloom_tool_calls_total",
+            "Total model-dispatched tool calls",
+            ["tool_name", "status"],
+        )
+        self._prom_histograms["tool_call_duration"] = prom.Histogram(
+            "agentloom_tool_call_duration_seconds",
+            "Tool-call execution duration",
+            ["tool_name"],
+            buckets=[0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10, 30, 60],
+        )
         self._prom_histograms["time_to_first_chunk"] = prom.Histogram(
             "gen_ai_client_operation_time_to_first_chunk_seconds",
             "GenAI streaming time-to-first-chunk (per OTel GenAI conventions)",
@@ -465,6 +490,24 @@ def record_attachments(self, step_type: str, count: int) -> None:
         else:  # pragma: no cover — prom fallback
             self._prom_counters["attachments"].labels(step_type=step_type).inc(count)
 
+    def record_tool_call(self, tool_name: str, success: bool, duration_s: float) -> None:
+        """Record a model-dispatched tool call (#116).
+
+        Counter is tagged ``status=success|failure`` so dashboards can
+        plot a per-tool failure rate; histogram tracks execution latency.
+        """
+        if not self._enabled:
+            return
+        status = "success" if success else "failure"
+        if self._backend == "otel":
+            self._tool_call_counter.add(1, {"tool_name": tool_name, "status": status})
+            self._tool_call_histogram.record(duration_s, {"tool_name": tool_name})
+        else:  # pragma: no cover — prom fallback
+            self._prom_counters["tool_calls"].labels(tool_name=tool_name, status=status).inc()
+            self._prom_histograms["tool_call_duration"].labels(tool_name=tool_name).observe(
+                duration_s
+            )
+
     def record_stream_response(self, provider: str, model: str) -> None:
         if not self._enabled:
             return

diff --git a/src/agentloom/observability/noop.py b/src/agentloom/observability/noop.py
@@ -139,6 +139,9 @@ def on_provider_call_end(
     def on_provider_error(self, provider: str, error_type: str, **kwargs: Any) -> None:
         pass
 
+    def on_tool_call(self, **kwargs: Any) -> None:
+        pass
+
     def on_stream_response(self, provider: str, model: str, ttft_s: float, **kwargs: Any) -> None:
         pass
 

diff --git a/src/agentloom/observability/observer.py b/src/agentloom/observability/observer.py
@@ -304,6 +304,44 @@ def on_provider_call_end(
         else:
             span.end()
 
+    def on_tool_call(
+        self,
+        *,
+        step_id: str,
+        call_id: str,
+        tool_name: str,
+        args_hash: str,
+        result_hash: str,
+        duration_ms: float,
+        success: bool,
+        **kwargs: Any,
+    ) -> None:
+        """Record a model-dispatched tool call (#116).
+
+        Emits a child span under the active step span carrying the canonical
+        ``execute_tool {name}`` name plus tool attrs, and records the
+        per-tool counter + histogram. Args / result are hashed (not raw)
+        so PII never lands on the trace.
+        """
+        if self._metrics:
+            self._metrics.record_tool_call(tool_name, success, duration_ms / 1000.0)
+        if self._tracing:
+            attrs: dict[str, Any] = {
+                SpanAttr.TOOL_CALL_ID: call_id,
+                SpanAttr.TOOL_NAME: tool_name,
+                SpanAttr.TOOL_ARGS_HASH: args_hash,
+                SpanAttr.TOOL_RESULT_HASH: result_hash,
+                SpanAttr.TOOL_DURATION_MS: duration_ms,
+                SpanAttr.TOOL_SUCCESS: success,
+            }
+            if self._run_id:
+                attrs[SpanAttr.WORKFLOW_RUN_ID] = self._run_id
+            span = self._tracing.start_span(
+                SpanName.GEN_AI_TOOL_CALL.format(tool_name=tool_name),
+                attributes=attrs,
+            )
+            self._tracing.end_span(span)
+
     def on_provider_error(
         self,
         provider: str,

diff --git a/src/agentloom/observability/schema.py b/src/agentloom/observability/schema.py
@@ -109,9 +109,11 @@ class SpanAttr:
     PROVIDER_ATTEMPT_OUTCOME = "agentloom.provider.attempt_outcome"
 
     # Tool calls
+    TOOL_CALL_ID = "tool.call_id"
     TOOL_NAME = "tool.name"
     TOOL_ARGS_HASH = "tool.args_hash"
     TOOL_RESULT_HASH = "tool.result_hash"
+    TOOL_DURATION_MS = "tool.duration_ms"
     TOOL_SUCCESS = "tool.success"
 
     # Prompt metadata (AgentLoom-specific, no full-prompt capture by default)
@@ -181,6 +183,12 @@ class MetricName:
     # Streaming response counter (no OTel equivalent — AgentLoom-specific).
     STREAM_RESPONSES_TOTAL = "agentloom_stream_responses_total"
 
+    # Tool calls dispatched by the model (#116). Counter tagged by tool
+    # name + status (success / failure); histogram captures execution
+    # latency per tool.
+    TOOL_CALLS_TOTAL = "agentloom_tool_calls_total"
+    TOOL_CALL_DURATION_SECONDS = "agentloom_tool_call_duration_seconds"
+
     # Resilience gauges
     CIRCUIT_BREAKER_STATE = "agentloom_circuit_breaker_state"
     BUDGET_REMAINING_USD = "agentloom_budget_remaining_usd"
-Original file line number
+Diff line change
@@ Expand Up / @@ -139,6 +139,9 @@ def on_provider_call_end( @@
         def on_provider_error(self, provider: str, error_type: str, **kwargs: Any) -> None:
             pass
+        def on_tool_call(self, **kwargs: Any) -> None:
+            pass
         def on_stream_response(self, provider: str, model: str, ttft_s: float, **kwargs: Any) -> None:
             pass
@@ Expand Down @@