Agent Reliability Arena

Production readiness benchmarking for AI agents.

Built by Aviskaar Applied AI Research Lab.

The Problem

Existing benchmarks (GAIA, τ-bench, AgentBench, SWE-bench) measure agent capability — can it do the task? They don't measure production reliability — will it do it consistently, gracefully, and safely under real-world conditions?

Agent Reliability Arena answers: "Is this agent ready for production?"

It stress-tests your agent across five dimensions that matter in deployment: consistency, robustness under input variation, tool failure recovery, multi-turn memory coherence, and enterprise-grade resilience.

Five Eval Tracks

Track	What It Tests	Key Metric
Consistency	Output stability across identical inputs (N runs)	Mean pairwise semantic similarity
Robustness	Performance under paraphrased and noisy inputs	Success rate across input variants
Tool Failure	Recovery from injected tool call failures	Tool recovery rate
Memory Drift	Factual coherence across multi-turn conversations	LLM-judge coherence score
Enterprise Realism	Schema drift, permission denial, audit-log completeness	Enterprise Readiness Index (ERI)

Quickstart

Install

pip install agent-reliability-arena

Or install from source for development:

git clone https://github.com/aviskaar/ara.git
cd ara
pip install -e ".[dev]"

Run an evaluation

arena run configs/example_agent.yaml

Reports are saved to reports/ as Markdown and JSON.

Launch the dashboard

streamlit run dashboard/app.py

Requirements

Python 3.10+
An ANTHROPIC_API_KEY for paraphrase generation and LLM judging (Robustness, Memory Drift, Enterprise Realism tracks)
An OPENAI_API_KEY if using langchain_react or openai_assistants agent types

Set your keys in the environment before running:

export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."          # if using OpenAI-backed agents

Configuration

Define your agent in a YAML file:

agent:
  name: "My Customer Service Agent"
  type: langchain_react        # langchain_react | openai_assistants | custom
  model: "gpt-4o"
  system_prompt: "You are a helpful customer service agent."
  tools:
    - name: search_orders
      type: api
      description: "Search customer orders by order ID or email"
    - name: refund_tool
      type: api
      description: "Initiate a refund for a given order ID"

eval:
  tracks:
    - consistency
    - robustness
    - tool_failure
    - memory_drift
    - enterprise_realism
  runs_per_track: 10
  task_suite: customer_service_v1      # built-in or path to custom .yaml
  paraphrase_model: claude-sonnet-4-20250514
  paraphrase_provider: anthropic
  timeout_seconds: 60

output:
  report_format: markdown
  leaderboard_submit: false
  export_failure_clips: true
  output_dir: reports

Configuration Reference

`agent`

Field	Type	Required	Description
`name`	string	yes	Human-readable agent name
`type`	enum	yes	`langchain_react`, `openai_assistants`, or `custom`
`model`	string	no	Model identifier (default: `gpt-4o`)
`system_prompt`	string	no	System prompt injected into the agent
`tools`	list	no	Tool definitions (name, type, description)
`entry_point`	string	no	Required for `custom` type: `"module.path::function_name"`

`eval`

Field	Type	Default	Description
`tracks`	list	`[consistency]`	Tracks to run
`runs_per_track`	int	`10`	Agent invocations per track
`task_suite`	string	`general_v1`	Built-in suite name or path to custom YAML
`paraphrase_model`	string	`claude-sonnet-4-20250514`	Model used for paraphrase generation and LLM judging
`timeout_seconds`	int	`60`	Per-run timeout

`output`

Field	Type	Default	Description
`report_format`	enum	`markdown`	`markdown` or `json`
`export_failure_clips`	bool	`true`	Include failure run traces in report
`output_dir`	string	`reports`	Directory for generated reports

Supported Agent Frameworks

LangChain ReAct

Bring any AgentExecutor. Arena builds stub tools from your YAML definitions and wraps the executor automatically:

agent:
  type: langchain_react
  model: "gpt-4o"

OpenAI Assistants API

Thread-per-run evaluation using the Assistants v2 API. Requires OPENAI_API_KEY:

agent:
  type: openai_assistants
  model: "gpt-4o"

Custom Python Agent

Any run(prompt: str) -> str function, sync or async. Point to it with entry_point:

agent:
  type: custom
  entry_point: "my_package.my_agent::run"

The function must accept a single prompt: str argument and return a string.

Built-in Task Suites

Suite	Tasks	Description
`general_v1`	5	General reasoning, tool use, and multi-step planning
`customer_service_v1`	5	Order management, refunds, and product support

Custom Task Suites

Create a YAML file with the following structure and pass its path as task_suite:

name: my_custom_suite
description: "My domain-specific evaluation tasks"

tasks:
  - id: task_001
    prompt: "Your main task prompt here"
    expected_keywords: ["keyword1", "keyword2"]
    turns:
      - prompt: "First turn in a multi-turn conversation"
      - prompt: "Second turn — tests memory retention"
      - prompt: "Third turn — tests graceful conclusion"

Each task supports:

id — unique identifier
prompt — primary task prompt (used by single-turn tracks)
expected_keywords — list of terms expected in the output
turns — list of prompts for multi-turn tracks (Memory Drift, Enterprise Realism)

Python API

from arena import ArenaRunner
import asyncio

runner = ArenaRunner("configs/my_agent.yaml")
report = asyncio.run(runner.run())

print(f"Overall score:            {report.overall_score:.3f}")
print(f"Enterprise Readiness:     {report.enterprise_readiness_index:.3f}")
print(f"Summary: {report.summary}")

for result in report.track_results:
    print(f"  {result.track.value:<22}  {result.score:.3f}  {result.label}")

Streaming track results

Use the on_track_complete callback to process results as each track finishes:

def on_complete(result):
    print(f"Track complete: {result.track.value} → {result.score:.3f}")

runner = ArenaRunner("configs/my_agent.yaml", on_track_complete=on_complete)
report = asyncio.run(runner.run())

Report models

ArenaReport
├── agent_name: str
├── timestamp: str               # ISO 8601 UTC
├── arena_version: str
├── overall_score: float         # computed property, mean across tracks
├── enterprise_readiness_index: float | None
├── summary: str
├── config: ArenaConfig
└── track_results: list[TrackResult]
    ├── track: TrackName
    ├── score: float             # 0.0 – 1.0
    ├── label: str               # human-readable verdict
    ├── runs: list[AgentRun]
    ├── failure_clips: list[dict]
    └── metadata: dict

Saving reports manually

from arena.reporter import save_report

paths = save_report(report, output_dir="reports")
# Returns {"markdown": Path(...), "json": Path(...)}

Score Interpretation

Score	Grade	Verdict
0.95 – 1.00	A	Production Ready
0.85 – 0.94	B	Conditionally Ready
0.70 – 0.84	C	Needs Improvement
0.55 – 0.69	D	Significant Gaps
0.00 – 0.54	F	Not Production Ready

The Enterprise Readiness Index (ERI) is a weighted composite of consistency (35%), tool recovery (35%), and audit coverage (30%). It is only produced when the enterprise_realism track is included.

Dashboard

Arena ships with a Streamlit dashboard for interactive evaluation and report visualization.

streamlit run dashboard/app.py

Features:

Configure agent and tracks without editing YAML
Upload an existing arena_config.yaml
Live progress during evaluation
Radar chart of track scores
Failure clip browser
One-click Markdown and JSON report export

Project Structure

ara/
├── arena/
│   ├── __init__.py          # Public API: ArenaRunner, ArenaReport, TrackName
│   ├── cli.py               # `arena run` command
│   ├── runner.py            # Top-level orchestrator
│   ├── models.py            # Pydantic data models
│   ├── config.py            # YAML config loader
│   ├── agent_adapter.py     # Adapters: LangChain, OpenAI Assistants, Custom
│   ├── reporter.py          # Markdown + JSON report generator
│   ├── tracks/
│   │   └── __init__.py      # Five track implementations + TRACK_MAP registry
│   ├── scorers/
│   │   └── __init__.py      # Semantic similarity, LLM judge, tool recovery, ERI
│   └── injectors/
│       └── __init__.py      # Tool failure, paraphrase, schema drift injectors
├── configs/
│   ├── example_agent.yaml
│   └── task_suites/
│       ├── general_v1.yaml
│       └── customer_service_v1.yaml
├── dashboard/
│   └── app.py               # Streamlit dashboard
├── pyproject.toml
└── Makefile

Development

# Install with dev dependencies
make install

# Run linter
make lint

# Run tests
make test

# Clean build artifacts and reports
make clean

Contributing

Contributions are welcome. Please open an issue first to discuss major changes.

Fork the repository
Create a feature branch (git checkout -b feat/my-track)
Make your changes and add tests
Run make lint and make test
Open a pull request

To add a new eval track, implement BaseTrack in arena/tracks/__init__.py and register it in TRACK_MAP.

License

Apache 2.0 — Aviskaar Applied AI Research Lab

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
arena		arena
configs		configs
dashboard		dashboard
CLAUDE.md		CLAUDE.md
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Agent Reliability Arena

The Problem

Five Eval Tracks

Quickstart

Install

Run an evaluation

Launch the dashboard

Requirements

Configuration

Configuration Reference

agent

eval

output

Supported Agent Frameworks

LangChain ReAct

OpenAI Assistants API

Custom Python Agent

Built-in Task Suites

Custom Task Suites

Python API

Streaming track results

Report models

Saving reports manually

Score Interpretation

Dashboard

Project Structure

Development

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`agent`

`eval`

`output`

Packages