-
Notifications
You must be signed in to change notification settings - Fork 6
Benchmark: τ²-bench — tool-agent-user interaction #25
Copy link
Copy link
Open
Description
Overview
Evaluate OpenSymbolicAI against τ²-bench (Sierra Research) — a benchmark for tool-agent-user interaction across retail, airline, and banking domains.
Why this benchmark
- Even GPT-4 achieves <50% success rate, and only ~25% consistency when repeating tasks
- Tests tool use + policy compliance — our deterministic execution guarantees policy adherence
- Live leaderboard with growing industry attention
- Multi-domain coverage (retail, airline, telecom, banking) demonstrates generalizability
References
Tasks
- Review τ²-bench dataset and domains
- Implement domain-specific primitives for each vertical
- Build agent using DesignExecute or GoalSeeking blueprint
- Run evaluation and collect results
- Document findings and comparison
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels