Skip to content

Benchmark: WildToolBench — real-world tool composition #23

@rajkumar42

Description

@rajkumar42

Overview

Evaluate OpenSymbolicAI against the WildToolBench benchmark — an LLM tool-use benchmark grounded in real-world user behavior patterns.

Why this benchmark

  • No model exceeds 15% accuracy across 57 LLMs tested — massive room to demonstrate OpenSymbolicAI's advantage
  • Tests three key challenges that map directly to our architecture:
    1. Compositional tasks demanding efficient orchestration of tool-call topologies → our primitive composition
    2. Implicit intent spread across dialogue turns requiring contextual inference → our decomposition examples
    3. Instruction transitions mixing task queries, clarifications, and conversation → our plan-then-execute separation

References

Tasks

  • Review benchmark paper and dataset format
  • Map WildToolBench tools to OpenSymbolicAI primitives
  • Implement benchmark harness
  • Run evaluation and collect results
  • Document findings

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions