Skip to content

feat(providers): add native tool/function calling#145

Open
cchinchilla-dev wants to merge 4 commits intomainfrom
feat/tool-calling-116
Open

feat(providers): add native tool/function calling#145
cchinchilla-dev wants to merge 4 commits intomainfrom
feat/tool-calling-116

Conversation

@cchinchilla-dev
Copy link
Copy Markdown
Owner

What

Adds native tool / function calling across all four LLM providers. The model decides at runtime which tool to invoke; the engine dispatches via the existing ToolRegistry, feeds results back, and re-prompts until the model stops asking for tools (capped by max_tool_iterations, default 5).

  • Unified surfaceStepDefinition.tools, tool_choice, max_tool_iterations on llm_call. New ToolDefinition model; new ProviderResponse.tool_calls: list[ToolCall]. LLMCallStep._run_tool_loop accumulates tokens + cost and surfaces finish_reason="max_tool_iterations" when the cap fires.
  • Per-provider wire translation — each adapter maps the unified spec to its native shape: OpenAI / Ollama use tools=[{type:"function", function:...}], Anthropic tools=[{name, input_schema}], Google groups under function_declarations. Tool-loop messages ({"role":"assistant","tool_calls":[...]}, {"role":"tool","tool_call_id":...}, Gemini parts shapes) round-trip through _format_messages verbatim — without this passthrough, three of four providers silently dropped iteration 2+ messages.
  • Parallel dispatch — multiple tool calls in one response execute concurrently via an anyio task group; failures are reported back as text so the model can recover on the next turn.
  • Streaming events — typed StreamEvent hierarchy (TextDelta, ToolCallDelta, ToolCallComplete, StreamDone) plus sr.events() on StreamResponse. Backwards-compat: async for chunk in sr keeps yielding plain text strings.
  • Observability — per-call execute_tool {name} child span carrying tool.{call_id,name,args_hash,result_hash,duration_ms,success} and workflow.run_id; new agentloom_tool_calls_total{tool_name,status} counter + agentloom_tool_call_duration_seconds{tool_name} histogram. Args / result are SHA-256 hashed (truncated to 16 hex) so PII never lands on the trace.
  • Sandbox / budget / retry — dispatched calls go through tool_registry.get(name).execute(args), honoring #105 sandbox enforcement; loop respects budget (#108) and per-step retry policy (#106).
  • MockProvider replay — recordings carry a list of turns per step, each with its own tool_calls block hydrated as ToolCall objects so offline replay drives the loop end-to-end.
  • Bug fix found via real-Ollama smokeparse_tool_calls_from_openai now handles two wire variants: OpenAI canonical (type:"function" + arguments as JSON string) and Ollama-compat (no type field, arguments as decoded dict). Without this, Ollama tool calling silently dropped every call.
  • Google tool_choice={"name": "..."} — translates to ANY mode + allowedFunctionNames rather than silently falling through to AUTO.

Why

Closes the gap that the engine couldn't surface tool decisions to the LLM — ToolStep was a static DAG node, never a model-driven dispatch. Without this, ReAct loops, deep-research agents, function-calling assistants, and any benchmark that compares tool selection across models were inexpressible. Foundation for #119 (conversation history) and #120 (Agent primitive).

Closes #116

Testing

  • uv run pytest — 1205 passed
  • uv run ruff check src/ tests/ clean
  • uv run mypy src/ clean
  • CLI smoke (mock replay): tool loop drives 2 iterations against the committed recording, final answer reaches state
  • Real-provider smoke (Ollama llama3.1:8b): full roundtrip, tool dispatched with correct args (17+25=42), 296 tokens across 2 iterations
  • Docker → Jaeger smoke (Ollama, real OTel collector): 4 spans visible — workflow → step:solve → chat × 2 iterations, all sharing workflow.run_id, canonical OTel gen_ai.* attrs per turn
  • Kubernetes smoke (kind cluster, host Ollama via host.docker.internal): tool dispatched (7+11=18), 346 tokens
  • Regression tests for the 3 critical provider-format passthrough bugs (OpenAI/Ollama/Google iteration 2+)
  • External auditor pass: spec coverage CRITICAL → addressed; observability + streaming events + Google tool_choice dict → addressed
  • Coverage on touched files: observability/{observer,noop,schema}.py 100%, metrics.py 99%, providers 90-95%

Notes

ToolStep (the static DAG node) keeps working unchanged for workflows that want explicit author-driven tool execution. The new tools= field on llm_call is the dynamic, model-driven path; nothing existing breaks.

Streaming tool-call events: the StreamEvent API surface is implemented with a default events() wrapper that emits TextDelta per chunk + StreamDone at the end. Per-provider native streaming of ToolCallDelta / ToolCallComplete deltas is a follow-up — adapters can register a typed event iterator via _set_event_iterator once SSE parsers are extended.

Anthropic rolls thinking tokens into output_tokens (no separate field on the wire), so workflows combining thinking + tools on Claude show reasoning_tokens=0 even when the model used extended thinking — documented limitation, cost is still correct.

@cchinchilla-dev cchinchilla-dev added enhancement New feature or request providers Provider gateway and adapters core Core engine, DAG, state labels May 10, 2026
Copilot AI review requested due to automatic review settings May 10, 2026 12:30
@cchinchilla-dev cchinchilla-dev added enhancement New feature or request providers Provider gateway and adapters core Core engine, DAG, state labels May 10, 2026
@github-actions github-actions Bot added documentation Documentation improvements observability Tracing, metrics, logging labels May 10, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 10, 2026

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class, model-driven tool/function calling to AgentLoom’s llm_call step, with a unified tool schema and per-provider translations/parsing (OpenAI, Ollama, Anthropic, Google). This enables ReAct-style loops where the model selects tools at runtime, the engine dispatches via ToolRegistry, and the step re-prompts until completion or an iteration cap.

Changes:

  • Introduces unified tool-calling models (ToolDefinition, ToolCall) and plumbs tools, tool_choice, and max_tool_iterations through LLMCallStep and provider adapters.
  • Adds tool-loop message synthesis + parallel tool dispatch helpers, plus typed streaming event primitives (StreamEvent + subclasses).
  • Extends observability with per-tool-call span/metrics support and adds a mock replay fixture + example workflow + docs/changelog updates.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/providers/test_tool_calling.py End-to-end and per-provider regression tests for translation/parsing, tool-loop iteration, passthrough formatting, streaming events wrapper, and dispatch observability hook.
tests/observability/test_observer.py Verifies WorkflowObserver.on_tool_call emits the expected span and records metrics.
tests/observability/test_noop.py Ensures the noop observer implements the new on_tool_call hook.
tests/observability/test_metrics.py Validates MetricsManager.record_tool_call counter/histogram labeling and recording.
src/agentloom/steps/llm_call.py Implements the tool-call loop inside llm_call, accumulating tokens/cost across iterations and surfacing iteration-cap finish reason.
src/agentloom/steps/_tools.py Adds tool translation/parsing helpers, parallel dispatch (anyio task group), and provider-specific message builders for tool loop turns.
src/agentloom/providers/openai.py Adds tool definitions/tool_choice wiring, tool_call parsing, and passthrough formatting for tool-loop messages.
src/agentloom/providers/ollama.py Adds tool definitions wiring, tool_call parsing via OpenAI-shaped parser, and passthrough formatting for tool-loop messages.
src/agentloom/providers/mock.py Supports multi-turn step recordings (for tool loops) and hydrates recorded tool_calls for replay.
src/agentloom/providers/google.py Adds Gemini tool declaration + tool_choice mapping, tool_call parsing, and passthrough formatting for parts-based tool-loop messages.
src/agentloom/providers/base.py Introduces ToolCall, typed streaming event models, and exposes tool_calls on ProviderResponse / StreamResponse.
src/agentloom/providers/anthropic.py Adds tool declaration + tool_choice mapping, tool_use parsing, and passthrough behavior for tool blocks.
src/agentloom/observability/schema.py Adds span-attr keys for tool-call id/duration and metric names for tool-call counters/histograms.
src/agentloom/observability/observer.py Implements on_tool_call to emit execute_tool {name} spans and record per-tool metrics.
src/agentloom/observability/noop.py Adds noop on_tool_call hook.
src/agentloom/observability/metrics.py Adds tool-call counter + histogram creation and record_tool_call() API.
src/agentloom/core/models.py Adds ToolDefinition and new StepDefinition fields for tool calling (tools, tool_choice, max_tool_iterations).
recordings/tool_calling.json Adds a multi-turn mock recording demonstrating a tool-call + follow-up completion.
examples/35_tool_calling.yaml New example workflow showing model-driven tool calling (mock replay by default, optional live provider run).
docs/workflow-yaml.md Documents the new llm_call tool-calling fields and clarifies static tool step vs model-driven tools.
docs/examples.md Adds documentation entry for example 35 (native tool calling).
CHANGELOG.md Changelog entry describing the new tool calling capability and related behavior.

Comment thread src/agentloom/providers/openai.py
Comment thread src/agentloom/steps/_tools.py
Comment thread src/agentloom/steps/_tools.py
Comment thread src/agentloom/providers/base.py
Comment thread src/agentloom/core/models.py
Comment thread src/agentloom/steps/llm_call.py
Comment thread src/agentloom/steps/_tools.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core engine, DAG, state documentation Documentation improvements enhancement New feature or request observability Tracing, metrics, logging providers Provider gateway and adapters

Projects

None yet

Development

Successfully merging this pull request may close these issues.

add native tool/function calling with streaming and parallel-call support

2 participants