Skip to content

Benchmark: τ²-bench — tool-agent-user interaction #25

@rajkumar42

Description

@rajkumar42

Overview

Evaluate OpenSymbolicAI against τ²-bench (Sierra Research) — a benchmark for tool-agent-user interaction across retail, airline, and banking domains.

Why this benchmark

  • Even GPT-4 achieves <50% success rate, and only ~25% consistency when repeating tasks
  • Tests tool use + policy compliance — our deterministic execution guarantees policy adherence
  • Live leaderboard with growing industry attention
  • Multi-domain coverage (retail, airline, telecom, banking) demonstrates generalizability

References

Tasks

  • Review τ²-bench dataset and domains
  • Implement domain-specific primitives for each vertical
  • Build agent using DesignExecute or GoalSeeking blueprint
  • Run evaluation and collect results
  • Document findings and comparison

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions