Production readiness benchmarking for AI agents.
Built by Aviskaar Applied AI Research Lab.
Existing benchmarks (GAIA, τ-bench, AgentBench, SWE-bench) measure agent capability — can it do the task? They don't measure production reliability — will it do it consistently, gracefully, and safely under real-world conditions?
Agent Reliability Arena answers: "Is this agent ready for production?"
It stress-tests your agent across five dimensions that matter in deployment: consistency, robustness under input variation, tool failure recovery, multi-turn memory coherence, and enterprise-grade resilience.
| Track | What It Tests | Key Metric |
|---|---|---|
| Consistency | Output stability across identical inputs (N runs) | Mean pairwise semantic similarity |
| Robustness | Performance under paraphrased and noisy inputs | Success rate across input variants |
| Tool Failure | Recovery from injected tool call failures | Tool recovery rate |
| Memory Drift | Factual coherence across multi-turn conversations | LLM-judge coherence score |
| Enterprise Realism | Schema drift, permission denial, audit-log completeness | Enterprise Readiness Index (ERI) |
pip install agent-reliability-arenaOr install from source for development:
git clone https://github.com/aviskaar/ara.git
cd ara
pip install -e ".[dev]"arena run configs/example_agent.yamlReports are saved to reports/ as Markdown and JSON.
streamlit run dashboard/app.py- Python 3.10+
- An
ANTHROPIC_API_KEYfor paraphrase generation and LLM judging (Robustness, Memory Drift, Enterprise Realism tracks) - An
OPENAI_API_KEYif usinglangchain_reactoropenai_assistantsagent types
Set your keys in the environment before running:
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..." # if using OpenAI-backed agentsDefine your agent in a YAML file:
agent:
name: "My Customer Service Agent"
type: langchain_react # langchain_react | openai_assistants | custom
model: "gpt-4o"
system_prompt: "You are a helpful customer service agent."
tools:
- name: search_orders
type: api
description: "Search customer orders by order ID or email"
- name: refund_tool
type: api
description: "Initiate a refund for a given order ID"
eval:
tracks:
- consistency
- robustness
- tool_failure
- memory_drift
- enterprise_realism
runs_per_track: 10
task_suite: customer_service_v1 # built-in or path to custom .yaml
paraphrase_model: claude-sonnet-4-20250514
paraphrase_provider: anthropic
timeout_seconds: 60
output:
report_format: markdown
leaderboard_submit: false
export_failure_clips: true
output_dir: reports| Field | Type | Required | Description |
|---|---|---|---|
name |
string | yes | Human-readable agent name |
type |
enum | yes | langchain_react, openai_assistants, or custom |
model |
string | no | Model identifier (default: gpt-4o) |
system_prompt |
string | no | System prompt injected into the agent |
tools |
list | no | Tool definitions (name, type, description) |
entry_point |
string | no | Required for custom type: "module.path::function_name" |
| Field | Type | Default | Description |
|---|---|---|---|
tracks |
list | [consistency] |
Tracks to run |
runs_per_track |
int | 10 |
Agent invocations per track |
task_suite |
string | general_v1 |
Built-in suite name or path to custom YAML |
paraphrase_model |
string | claude-sonnet-4-20250514 |
Model used for paraphrase generation and LLM judging |
timeout_seconds |
int | 60 |
Per-run timeout |
| Field | Type | Default | Description |
|---|---|---|---|
report_format |
enum | markdown |
markdown or json |
export_failure_clips |
bool | true |
Include failure run traces in report |
output_dir |
string | reports |
Directory for generated reports |
Bring any AgentExecutor. Arena builds stub tools from your YAML definitions and wraps the executor automatically:
agent:
type: langchain_react
model: "gpt-4o"Thread-per-run evaluation using the Assistants v2 API. Requires OPENAI_API_KEY:
agent:
type: openai_assistants
model: "gpt-4o"Any run(prompt: str) -> str function, sync or async. Point to it with entry_point:
agent:
type: custom
entry_point: "my_package.my_agent::run"The function must accept a single prompt: str argument and return a string.
| Suite | Tasks | Description |
|---|---|---|
general_v1 |
5 | General reasoning, tool use, and multi-step planning |
customer_service_v1 |
5 | Order management, refunds, and product support |
Create a YAML file with the following structure and pass its path as task_suite:
name: my_custom_suite
description: "My domain-specific evaluation tasks"
tasks:
- id: task_001
prompt: "Your main task prompt here"
expected_keywords: ["keyword1", "keyword2"]
turns:
- prompt: "First turn in a multi-turn conversation"
- prompt: "Second turn — tests memory retention"
- prompt: "Third turn — tests graceful conclusion"Each task supports:
id— unique identifierprompt— primary task prompt (used by single-turn tracks)expected_keywords— list of terms expected in the outputturns— list of prompts for multi-turn tracks (Memory Drift, Enterprise Realism)
from arena import ArenaRunner
import asyncio
runner = ArenaRunner("configs/my_agent.yaml")
report = asyncio.run(runner.run())
print(f"Overall score: {report.overall_score:.3f}")
print(f"Enterprise Readiness: {report.enterprise_readiness_index:.3f}")
print(f"Summary: {report.summary}")
for result in report.track_results:
print(f" {result.track.value:<22} {result.score:.3f} {result.label}")Use the on_track_complete callback to process results as each track finishes:
def on_complete(result):
print(f"Track complete: {result.track.value} → {result.score:.3f}")
runner = ArenaRunner("configs/my_agent.yaml", on_track_complete=on_complete)
report = asyncio.run(runner.run())ArenaReport
├── agent_name: str
├── timestamp: str # ISO 8601 UTC
├── arena_version: str
├── overall_score: float # computed property, mean across tracks
├── enterprise_readiness_index: float | None
├── summary: str
├── config: ArenaConfig
└── track_results: list[TrackResult]
├── track: TrackName
├── score: float # 0.0 – 1.0
├── label: str # human-readable verdict
├── runs: list[AgentRun]
├── failure_clips: list[dict]
└── metadata: dictfrom arena.reporter import save_report
paths = save_report(report, output_dir="reports")
# Returns {"markdown": Path(...), "json": Path(...)}| Score | Grade | Verdict |
|---|---|---|
| 0.95 – 1.00 | A | Production Ready |
| 0.85 – 0.94 | B | Conditionally Ready |
| 0.70 – 0.84 | C | Needs Improvement |
| 0.55 – 0.69 | D | Significant Gaps |
| 0.00 – 0.54 | F | Not Production Ready |
The Enterprise Readiness Index (ERI) is a weighted composite of consistency (35%), tool recovery (35%), and audit coverage (30%). It is only produced when the enterprise_realism track is included.
Arena ships with a Streamlit dashboard for interactive evaluation and report visualization.
streamlit run dashboard/app.pyFeatures:
- Configure agent and tracks without editing YAML
- Upload an existing
arena_config.yaml - Live progress during evaluation
- Radar chart of track scores
- Failure clip browser
- One-click Markdown and JSON report export
ara/
├── arena/
│ ├── __init__.py # Public API: ArenaRunner, ArenaReport, TrackName
│ ├── cli.py # `arena run` command
│ ├── runner.py # Top-level orchestrator
│ ├── models.py # Pydantic data models
│ ├── config.py # YAML config loader
│ ├── agent_adapter.py # Adapters: LangChain, OpenAI Assistants, Custom
│ ├── reporter.py # Markdown + JSON report generator
│ ├── tracks/
│ │ └── __init__.py # Five track implementations + TRACK_MAP registry
│ ├── scorers/
│ │ └── __init__.py # Semantic similarity, LLM judge, tool recovery, ERI
│ └── injectors/
│ └── __init__.py # Tool failure, paraphrase, schema drift injectors
├── configs/
│ ├── example_agent.yaml
│ └── task_suites/
│ ├── general_v1.yaml
│ └── customer_service_v1.yaml
├── dashboard/
│ └── app.py # Streamlit dashboard
├── pyproject.toml
└── Makefile
# Install with dev dependencies
make install
# Run linter
make lint
# Run tests
make test
# Clean build artifacts and reports
make cleanContributions are welcome. Please open an issue first to discuss major changes.
- Fork the repository
- Create a feature branch (
git checkout -b feat/my-track) - Make your changes and add tests
- Run
make lintandmake test - Open a pull request
To add a new eval track, implement BaseTrack in arena/tracks/__init__.py and register it in TRACK_MAP.
Apache 2.0 — Aviskaar Applied AI Research Lab