Skip to content

aviskaar/ara

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Agent Reliability Arena

PyPI version Python 3.10+ License Code style: ruff

Production readiness benchmarking for AI agents.

Built by Aviskaar Applied AI Research Lab.


The Problem

Existing benchmarks (GAIA, τ-bench, AgentBench, SWE-bench) measure agent capability — can it do the task? They don't measure production reliability — will it do it consistently, gracefully, and safely under real-world conditions?

Agent Reliability Arena answers: "Is this agent ready for production?"

It stress-tests your agent across five dimensions that matter in deployment: consistency, robustness under input variation, tool failure recovery, multi-turn memory coherence, and enterprise-grade resilience.


Five Eval Tracks

Track What It Tests Key Metric
Consistency Output stability across identical inputs (N runs) Mean pairwise semantic similarity
Robustness Performance under paraphrased and noisy inputs Success rate across input variants
Tool Failure Recovery from injected tool call failures Tool recovery rate
Memory Drift Factual coherence across multi-turn conversations LLM-judge coherence score
Enterprise Realism Schema drift, permission denial, audit-log completeness Enterprise Readiness Index (ERI)

Quickstart

Install

pip install agent-reliability-arena

Or install from source for development:

git clone https://github.com/aviskaar/ara.git
cd ara
pip install -e ".[dev]"

Run an evaluation

arena run configs/example_agent.yaml

Reports are saved to reports/ as Markdown and JSON.

Launch the dashboard

streamlit run dashboard/app.py

Requirements

  • Python 3.10+
  • An ANTHROPIC_API_KEY for paraphrase generation and LLM judging (Robustness, Memory Drift, Enterprise Realism tracks)
  • An OPENAI_API_KEY if using langchain_react or openai_assistants agent types

Set your keys in the environment before running:

export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."          # if using OpenAI-backed agents

Configuration

Define your agent in a YAML file:

agent:
  name: "My Customer Service Agent"
  type: langchain_react        # langchain_react | openai_assistants | custom
  model: "gpt-4o"
  system_prompt: "You are a helpful customer service agent."
  tools:
    - name: search_orders
      type: api
      description: "Search customer orders by order ID or email"
    - name: refund_tool
      type: api
      description: "Initiate a refund for a given order ID"

eval:
  tracks:
    - consistency
    - robustness
    - tool_failure
    - memory_drift
    - enterprise_realism
  runs_per_track: 10
  task_suite: customer_service_v1      # built-in or path to custom .yaml
  paraphrase_model: claude-sonnet-4-20250514
  paraphrase_provider: anthropic
  timeout_seconds: 60

output:
  report_format: markdown
  leaderboard_submit: false
  export_failure_clips: true
  output_dir: reports

Configuration Reference

agent

Field Type Required Description
name string yes Human-readable agent name
type enum yes langchain_react, openai_assistants, or custom
model string no Model identifier (default: gpt-4o)
system_prompt string no System prompt injected into the agent
tools list no Tool definitions (name, type, description)
entry_point string no Required for custom type: "module.path::function_name"

eval

Field Type Default Description
tracks list [consistency] Tracks to run
runs_per_track int 10 Agent invocations per track
task_suite string general_v1 Built-in suite name or path to custom YAML
paraphrase_model string claude-sonnet-4-20250514 Model used for paraphrase generation and LLM judging
timeout_seconds int 60 Per-run timeout

output

Field Type Default Description
report_format enum markdown markdown or json
export_failure_clips bool true Include failure run traces in report
output_dir string reports Directory for generated reports

Supported Agent Frameworks

LangChain ReAct

Bring any AgentExecutor. Arena builds stub tools from your YAML definitions and wraps the executor automatically:

agent:
  type: langchain_react
  model: "gpt-4o"

OpenAI Assistants API

Thread-per-run evaluation using the Assistants v2 API. Requires OPENAI_API_KEY:

agent:
  type: openai_assistants
  model: "gpt-4o"

Custom Python Agent

Any run(prompt: str) -> str function, sync or async. Point to it with entry_point:

agent:
  type: custom
  entry_point: "my_package.my_agent::run"

The function must accept a single prompt: str argument and return a string.


Built-in Task Suites

Suite Tasks Description
general_v1 5 General reasoning, tool use, and multi-step planning
customer_service_v1 5 Order management, refunds, and product support

Custom Task Suites

Create a YAML file with the following structure and pass its path as task_suite:

name: my_custom_suite
description: "My domain-specific evaluation tasks"

tasks:
  - id: task_001
    prompt: "Your main task prompt here"
    expected_keywords: ["keyword1", "keyword2"]
    turns:
      - prompt: "First turn in a multi-turn conversation"
      - prompt: "Second turn — tests memory retention"
      - prompt: "Third turn — tests graceful conclusion"

Each task supports:

  • id — unique identifier
  • prompt — primary task prompt (used by single-turn tracks)
  • expected_keywords — list of terms expected in the output
  • turns — list of prompts for multi-turn tracks (Memory Drift, Enterprise Realism)

Python API

from arena import ArenaRunner
import asyncio

runner = ArenaRunner("configs/my_agent.yaml")
report = asyncio.run(runner.run())

print(f"Overall score:            {report.overall_score:.3f}")
print(f"Enterprise Readiness:     {report.enterprise_readiness_index:.3f}")
print(f"Summary: {report.summary}")

for result in report.track_results:
    print(f"  {result.track.value:<22}  {result.score:.3f}  {result.label}")

Streaming track results

Use the on_track_complete callback to process results as each track finishes:

def on_complete(result):
    print(f"Track complete: {result.track.value}{result.score:.3f}")

runner = ArenaRunner("configs/my_agent.yaml", on_track_complete=on_complete)
report = asyncio.run(runner.run())

Report models

ArenaReport
├── agent_name: str
├── timestamp: str               # ISO 8601 UTC
├── arena_version: str
├── overall_score: float         # computed property, mean across tracks
├── enterprise_readiness_index: float | None
├── summary: str
├── config: ArenaConfig
└── track_results: list[TrackResult]
    ├── track: TrackName
    ├── score: float             # 0.0 – 1.0
    ├── label: str               # human-readable verdict
    ├── runs: list[AgentRun]
    ├── failure_clips: list[dict]
    └── metadata: dict

Saving reports manually

from arena.reporter import save_report

paths = save_report(report, output_dir="reports")
# Returns {"markdown": Path(...), "json": Path(...)}

Score Interpretation

Score Grade Verdict
0.95 – 1.00 A Production Ready
0.85 – 0.94 B Conditionally Ready
0.70 – 0.84 C Needs Improvement
0.55 – 0.69 D Significant Gaps
0.00 – 0.54 F Not Production Ready

The Enterprise Readiness Index (ERI) is a weighted composite of consistency (35%), tool recovery (35%), and audit coverage (30%). It is only produced when the enterprise_realism track is included.


Dashboard

Arena ships with a Streamlit dashboard for interactive evaluation and report visualization.

streamlit run dashboard/app.py

Features:

  • Configure agent and tracks without editing YAML
  • Upload an existing arena_config.yaml
  • Live progress during evaluation
  • Radar chart of track scores
  • Failure clip browser
  • One-click Markdown and JSON report export

Project Structure

ara/
├── arena/
│   ├── __init__.py          # Public API: ArenaRunner, ArenaReport, TrackName
│   ├── cli.py               # `arena run` command
│   ├── runner.py            # Top-level orchestrator
│   ├── models.py            # Pydantic data models
│   ├── config.py            # YAML config loader
│   ├── agent_adapter.py     # Adapters: LangChain, OpenAI Assistants, Custom
│   ├── reporter.py          # Markdown + JSON report generator
│   ├── tracks/
│   │   └── __init__.py      # Five track implementations + TRACK_MAP registry
│   ├── scorers/
│   │   └── __init__.py      # Semantic similarity, LLM judge, tool recovery, ERI
│   └── injectors/
│       └── __init__.py      # Tool failure, paraphrase, schema drift injectors
├── configs/
│   ├── example_agent.yaml
│   └── task_suites/
│       ├── general_v1.yaml
│       └── customer_service_v1.yaml
├── dashboard/
│   └── app.py               # Streamlit dashboard
├── pyproject.toml
└── Makefile

Development

# Install with dev dependencies
make install

# Run linter
make lint

# Run tests
make test

# Clean build artifacts and reports
make clean

Contributing

Contributions are welcome. Please open an issue first to discuss major changes.

  1. Fork the repository
  2. Create a feature branch (git checkout -b feat/my-track)
  3. Make your changes and add tests
  4. Run make lint and make test
  5. Open a pull request

To add a new eval track, implement BaseTrack in arena/tracks/__init__.py and register it in TRACK_MAP.


License

Apache 2.0 — Aviskaar Applied AI Research Lab

About

Agent Reliability Arena

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors