Benchmark: τ²-bench — tool-agent-user interaction

## Overview
Evaluate OpenSymbolicAI against **τ²-bench** (Sierra Research) — a benchmark for tool-agent-user interaction across retail, airline, and banking domains.

## Why this benchmark
- Even GPT-4 achieves **<50% success rate**, and only ~25% consistency when repeating tasks
- Tests tool use + policy compliance — our deterministic execution guarantees policy adherence
- Live leaderboard with growing industry attention
- Multi-domain coverage (retail, airline, telecom, banking) demonstrates generalizability

## References
- [τ-bench](https://taubench.com/)
- [τ²-bench (GitHub)](https://github.com/sierra-research/tau2-bench)
- [Sierra Blog](https://sierra.ai/blog/tau-bench-shaping-development-evaluation-agents)

## Tasks
- [ ] Review τ²-bench dataset and domains
- [ ] Implement domain-specific primitives for each vertical
- [ ] Build agent using DesignExecute or GoalSeeking blueprint
- [ ] Run evaluation and collect results
- [ ] Document findings and comparison

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark: τ²-bench — tool-agent-user interaction #25

Overview

Why this benchmark

References

Tasks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark: τ²-bench — tool-agent-user interaction #25

Description

Overview

Why this benchmark

References

Tasks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions