pytest for AI agents.
The open-source testing framework that tells you if your AI agent is actually working β or quietly breaking.
You build an AI agent. It works in testing. You ship it.
Then it starts hallucinating. Returning wrong formats. Calling the wrong tools. Breaking your pipeline.
You find out when something downstream crashes β not before.
TestThread fixes that.
Define what your agent should do. TestThread runs it, checks the output, and tells you exactly what passed and what failed β with AI diagnosis explaining why.
| Feature | Description |
|---|---|
| β Test Suites | Group test cases per agent |
| β 4 Match Types | contains, exact, regex, semantic (AI-powered) |
| π§ AI Diagnosis | When a test fails, AI explains why and suggests a fix |
| π Regression Detection | Flags when pass rate drops vs previous run |
| π PII Detection | Auto-fails if agent leaks emails, keys, SSNs, credit cards |
| β‘ Latency Tracking | Response time per test, average per run |
| π° Cost Estimation | Estimated token cost per run |
| π€οΈ Trajectory Assertions | Test agent steps, not just output |
| β° Scheduled Runs | Run suites hourly, daily, or weekly automatically |
| βοΈ CI/CD Integration | GitHub Action runs tests on every push |
| π Webhook Alerts | Get notified when tests fail or regress |
| π CSV Import | Bulk import test cases from a spreadsheet |
| π‘οΈ Rate Limiting | API protected from abuse |
pip install testthreadfrom testthread import TestThread
tt = TestThread(gemini_key="your-gemini-key")
# Create a test suite
suite = tt.create_suite(
name="My Agent Tests",
agent_endpoint="https://your-agent.com/run"
)
# Add test cases
tt.add_case(
suite_id=suite["id"],
name="Basic response check",
input="What is 2 + 2?",
expected_output="4",
match_type="contains"
)
tt.add_case(
suite_id=suite["id"],
name="Semantic check",
input="Say hello",
expected_output="a friendly greeting",
match_type="semantic"
)
# Run the suite
result = tt.run_suite(suite["id"])
print(f"Passed: {result['passed']} | Failed: {result['failed']}")
print(f"Pass Rate: {result['curr_pass_rate']}%")
print(f"Estimated Cost: ${result['estimated_cost_usd']}")npm install testthreadconst TestThread = require("testthread");
const tt = new TestThread("https://test-thread-production.up.railway.app", "your-gemini-key");
const suite = await tt.createSuite("My Agent", "https://your-agent.com/run");
await tt.addCase(suite.id, "Hello test", "Say hi", "hello", "contains");
const result = await tt.runSuite(suite.id);
console.log(`Passed: ${result.passed} | Failed: ${result.failed}`);| Type | Description |
|---|---|
contains |
Output contains the expected string |
exact |
Output matches exactly |
regex |
Output matches a regex pattern |
semantic |
AI judges if meaning matches (requires Gemini key) |
Test not just what your agent returned, but how it got there.
import requests
# Set assertions on a test case
requests.post(f"{BASE}/suites/{suite_id}/cases/{case_id}/assertions", json=[
{"type": "tool_called", "value": "search"},
{"type": "tool_not_called", "value": "delete_user"},
{"type": "max_steps", "value": 5}
])
# After your agent runs, submit its trajectory
requests.post(f"{BASE}/trajectory", json={
"run_id": run_id,
"case_id": case_id,
"steps": [
{"tool": "search", "input": "query", "output": "results", "order": 1},
{"tool": "summarize", "input": "results", "output": "summary", "order": 2}
]
})Supported assertion types:
tool_calledβ assert a tool was usedtool_not_calledβ assert a tool was NOT usedmax_stepsβ assert agent completed in N steps or fewermin_stepsβ assert agent took at least N stepstool_orderβ assert tools were called in a specific orderaction_calledβ assert a specific action was performed
Add one file to your repo and TestThread runs on every push:
- Add your suite ID to GitHub Secrets as
TESTTHREAD_SUITE_ID - Copy
.github/workflows/testthread.ymlfrom this repo into your project
Every push to main now runs your agent tests automatically. Fails the build if tests regress.
import requests
requests.post(f"{BASE}/suites/{suite_id}/schedule", json={
"schedule": "daily",
"schedule_enabled": True
})Options: hourly, daily, weekly
View all your test results at test-thread.lovable.app
Full docs at test-thread-production.up.railway.app/docs
git clone https://github.com/eugene001dayne/test-thread.git
cd test-thread
pip install -r requirements.txt
uvicorn main:app --reloadTestThread is part of a suite of open-source reliability tools for AI agents.
| Tool | What it does |
|---|---|
| Iron-Thread | Validates AI output structure before it hits your database |
| TestThread | Tests whether your agent behaves correctly across runs |
| PromptThread (coming soon) | Versions and tracks prompt performance over time |
Apache 2.0 β free to use, modify, and distribute.
Built for developers who ship AI agents and need to know they work.