__
(___()'`;
/, /`
\\"--\\
Sniff out every bug in your agent workflow.
AgentHound is the pytest-native testing framework for AI agent workflows. It records real agent sessions, replays them deterministically, and lets you assert on behavior and correctness — all without making a single API call.
pip install agenthound # Core framework
pip install agenthound[ui] # + debug UIAI agents are the fastest-growing category in software. But they're nearly untestable with existing methods:
- Non-deterministic — the same prompt produces different outputs every time
- Multi-step — errors cascade through tool calls and decision chains
- Expensive — every test run burns tokens and money
- Slow — round-trips to LLM APIs add seconds per assertion
The result: 38% of organizations are piloting agents, but only 11% have them in production. The gap is testing.
AgentHound brings the workflow developers already know — write test, make it pass, ship — to AI agents:
from agenthound import replay, expect
@replay("tests/fixtures/refund_flow.json")
def test_refund_agent(session):
result = my_agent.run("I want to return order ORD-123")
expect(session).tools_called(["lookup_order", "process_refund"])
expect(session).completed_successfully()
expect(result).contains("refund")$ pytest tests/ -v
tests/test_refund.py::test_refund_agent PASSED [100%]
============================== 1 passed in 0.02s ==============================
Zero API calls. Runs in milliseconds.
Run your agent once and capture every LLM call, tool invocation, and token count:
from agenthound import record_session
with record_session("tests/fixtures/refund_flow.json") as session:
result = my_agent.run("I want to return order ORD-123")
session.tag("happy_path", "refund")This writes a JSON fixture file. API keys are automatically redacted. Commit it to git alongside your tests.
The @replay decorator intercepts all HTTP calls and serves the recorded responses. Your agent thinks it's talking to the real API:
from agenthound import replay, expect
@replay("tests/fixtures/refund_flow.json")
def test_refund_agent(session):
result = my_agent.run("I want to return order ORD-123")
expect(session).tools_called(["lookup_order", "process_refund"])
expect(result).contains("refund")pytest tests/ -vNo API keys in CI. No network calls. No flaky tests. Just deterministic, sub-second assertions.
Don't want to record first? Define responses inline:
from agenthound import mock_llm, expect
@mock_llm(responses=[
{"tool_call": "search", "args": {"q": "weather in SF"}},
"It's 65F and sunny in San Francisco today.",
])
def test_weather_agent(session):
result = my_agent.run("What's the weather?")
expect(session).tools_called(["search"])
expect(session).has_llm_calls(2)
expect(result).contains("sunny")Works with both providers:
@mock_llm(responses=["Hello from Claude!"], provider="anthropic")
def test_with_claude(session):
...Test how your agent handles the real world — timeouts, rate limits, broken tools:
from agenthound import replay, inject_failure, expect
@replay("tests/fixtures/refund_flow.json")
@inject_failure(tool="process_refund", error="TimeoutError", at_call=1)
def test_handles_refund_timeout(session):
result = my_agent.run("I want to return order ORD-123")
# The agent should retry or degrade gracefullyPrevent runaway loops and token bloat with first-class assertions:
@replay("tests/fixtures/research_pipeline.json")
def test_research_stays_within_budget(session):
result = research_agent.run("Analyze the competitive landscape")
expect(session).total_tokens_under(50000) # Token budget
expect(session).max_turns(10) # Prevent runaway loopsFour layers of assertions. Most tests never need anything beyond Layer 3:
| Layer | What it checks | Example |
|---|---|---|
| Schema | Structure, counts | has_llm_calls(3), no_errors(), all_calls_have_usage() |
| Constraints | Budgets, limits | total_tokens_under(5000), latency_under(3000), max_turns(5) |
| Trace | Tool behavior | tools_called(["search", "respond"]), tool_called_with("search", {"q": "test"}) |
| Content | Response text | final_response_contains("refund"), final_response_matches(r"REF-\d+") |
Chain them for readable, comprehensive assertions:
(
expect(session)
.has_llm_calls(3)
.tools_called(["lookup_order", "process_refund"])
.model_used("gpt-4o-mini")
.no_errors()
.final_response_contains("refund")
.final_response_matches(r"REF-\d+")
)AgentHound intercepts at the httpx transport level, so it works with any LLM SDK built on httpx — no framework-specific adapters needed:
| Framework | How it works |
|---|---|
| OpenAI SDK | Intercepts openai.chat.completions.create() |
| Anthropic SDK | Intercepts anthropic.messages.create() |
| LangGraph / LangChain | Intercepts underlying SDK calls |
| Pydantic AI | Intercepts underlying SDK calls |
| CrewAI | Intercepts underlying SDK calls |
| Any httpx-based client | Intercepted automatically |
AgentHound is a standard pytest plugin. It works everywhere pytest works:
# .github/workflows/test.yml
name: Agent Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install -e ".[dev]"
- run: pytest tests/ -v
# No API keys needed — tests replay from fixturesfrom agenthound import record_session
# Record all LLM calls within the block and save to a fixture file
with record_session("path/to/fixture.json", metadata={"env": "dev"}) as session:
result = my_agent.run("prompt")
session.tag("happy_path", "v2")from agenthound import replay
# Replay a recorded fixture — all HTTP calls return recorded responses
@replay("path/to/fixture.json", strict=True)
def test_my_agent(session):
result = my_agent.run("prompt")from agenthound import mock_llm, mock_tool
# Mock LLM responses with a sequence of responses
@mock_llm(responses=["Hello!", {"tool_call": "search", "args": {"q": "test"}}], provider="openai")
def test_with_mock(session):
...
# Mock a tool function by import path
@mock_tool("search", target="myapp.tools.search_fn", returns={"results": []})
def test_with_mocked_tool():
...from agenthound import inject_failure
# Inject an error at the Nth call to a specific tool
@replay("fixtures/session.json")
@inject_failure(tool="process_refund", error="TimeoutError", at_call=1)
def test_failure_handling(session):
...import agenthound
# Global: capture all LLM calls, split sessions by 2s idle timeout
agenthound.auto_record("sessions/", tags=["dev"], metadata={"env": "local"})
# ... run your agent ...
agenthound.stop_auto_record()
# Per-function: each call becomes a fixture
@agenthound.recorded("sessions/", tags=["support"])
def handle_request(user_input):
return agent.run(user_input)from agenthound.importers.otel import import_otel_trace
import_otel_trace("trace.json", "fixtures/prod-session.json")agenthound-import otel trace.json fixtures/prod-session.jsonfrom agenthound import expect
# Schema (Layer 1)
expect(session).has_llm_calls(3)
expect(session).has_llm_calls_between(1, 5)
expect(session).no_errors()
expect(session).all_calls_have_usage()
# Constraints (Layer 2)
expect(session).total_tokens_under(5000)
expect(session).latency_under(3000)
expect(session).max_turns(5)
# Trace (Layer 3)
expect(session).tools_called(["search", "respond"])
expect(session).tools_called_unordered({"search", "respond"})
expect(session).tool_called("search", times=2)
expect(session).tool_called_with("search", {"q": "test"})
expect(session).tool_sequence(["search", "respond"])
expect(session).no_tool_errors()
expect(session).model_used("gpt-4o-mini")
expect(session).completed_successfully()
# Content (Layer 4)
expect(session).final_response_contains("refund")
expect(session).any_response_contains("order")
expect(session).final_response_matches(r"REF-\d+")expect(result).contains("refund")
expect(result).matches(r"REF-\d+")
expect(result).equals("expected value")
expect(result).is_type(str)
expect(result).has_field("status", "success")session.llm_calls # List[LLMCall] — all LLM calls in order
session.tools_called # List[str] — ordered tool names
session.tool_retries # Dict[str, int] — call count per tool
session.total_tokens # int — total tokens across all calls
session.total_duration_ms # float — total wall-clock time
session.tags # List[str] — tags from recording
session.metadata # Dict — metadata from recordingpytest --agenthound-record # Run in recording mode (real API calls)
pytest --agenthound-update # Re-record existing fixtures
pytest --agenthound-fixtures-dir # Set fixtures directory (default: tests/fixtures)AgentHound isn't just for test suites. You can record and debug agent sessions during development and in production.
Automatically capture every agent interaction without changing your code:
import agenthound
# Enable global auto-recording — every LLM call is captured
agenthound.auto_record("sessions/")
# Your existing code runs unchanged
result = my_agent.run("Hello") # -> sessions/2026-03-21T10-00-00_001.json
result = my_agent.run("Return ORD-123") # -> sessions/2026-03-21T10-00-05_002.json
# Disable when done
agenthound.stop_auto_record()Sessions are split automatically: if no API call happens for 2+ seconds, the current session is flushed to a file and a new one starts.
For more control, decorate specific functions so each invocation saves a fixture:
import agenthound
@agenthound.recorded("sessions/", tags=["support"])
def handle_support_request(user_input):
return agent.run(user_input)
# Each call saves a separate fixture
handle_support_request("Return order ORD-100")
handle_support_request("Where is my order?")A local web UI for stepping through recorded sessions:
pip install agenthound[ui]
agenthound-ui --fixtures-dir sessions/
# Open http://127.0.0.1:7600Features:
- Fixture browser — see all recorded sessions with tags, tokens, and step count
- Step-through debugger — click through each LLM call and tool invocation
- Step inspector — see model, tokens, tool arguments, and response text at each step
- Stats dashboard — aggregate totals across all sessions: tokens, models, tags, providers
- Keyboard navigation — arrow keys to step forward/back
- Live mode — toggle live updates to see new sessions appear in real-time as your agent runs
The debug UI includes a built-in HTTP proxy that intercepts live LLM calls from your running application — no code changes required. Point your app's SDK at the proxy, and every API call is forwarded to the real provider, recorded as a fixture, and appears in the UI in real-time.
1. Start the UI (the proxy is included automatically):
agenthound-ui --fixtures-dir sessions/ --port 76002. Point your app at the proxy:
The proxy lives at http://127.0.0.1:7600/proxy. Set your SDK's base URL to route through it:
# Anthropic SDK
export ANTHROPIC_BASE_URL=http://127.0.0.1:7600/proxy
# OpenAI SDK
export OPENAI_BASE_URL=http://127.0.0.1:7600/proxyOr set it in code:
from anthropic import Anthropic
client = Anthropic(base_url="http://127.0.0.1:7600/proxy")
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:7600/proxy")3. Use your app normally. Every LLM call flows through the proxy to the real API. Responses are returned unchanged. Each group of calls (separated by 5 seconds of idle time) is saved as a fixture file and appears live in the UI.
Docker: If your app runs in Docker, use host.docker.internal to reach the proxy on the host:
ANTHROPIC_BASE_URL=http://host.docker.internal:7600/proxyThe proxy is transparent — your app behaves exactly as before, but you get full visibility into every prompt, response, and token count.
Convert OpenTelemetry traces from Langfuse, Jaeger, or any OTEL-compatible tool into AgentHound fixtures, then debug them locally:
agenthound-import otel trace-export.json fixture.json
agenthound-ui --fixtures-dir .Or use the Python API:
from agenthound.importers.otel import import_otel_trace
import_otel_trace("trace-export.json", "fixtures/prod-session.json")AgentHound operates at the httpx transport layer — the same HTTP client used internally by both the OpenAI and Anthropic Python SDKs.
Recording: A custom httpx.BaseTransport wraps the real transport. Every HTTP request and response passes through unchanged, but gets captured into a structured log. On exit, the log is serialized to a JSON fixture file with auth headers automatically redacted.
Replay: A different custom transport serves pre-recorded responses in sequence. The Nth HTTP call gets the Nth recorded response. Your agent's code runs exactly as it would in production — it has no idea it's talking to a replay.
Assertions: The fixture contains two layers of data. The raw HTTP layer (used by replay) and a semantic layer with parsed LLM calls, tool invocations, and token counts (used by assertions). This separation keeps replay faithful and assertions ergonomic.
Your Agent Code
|
v
SDK (OpenAI / Anthropic)
|
v
httpx.Client
|
v
AgentHound Transport <-- intercepts here
|
v
Real API (recording) or Fixture (replay)
git clone https://github.com/martinwells/agenthound.git
cd agenthound
pip install -e ".[dev]"
pytest tests/ -vMIT

