-
Notifications
You must be signed in to change notification settings - Fork 6
Benchmark: WildToolBench — real-world tool composition #23
Copy link
Copy link
Open
Description
Overview
Evaluate OpenSymbolicAI against the WildToolBench benchmark — an LLM tool-use benchmark grounded in real-world user behavior patterns.
Why this benchmark
- No model exceeds 15% accuracy across 57 LLMs tested — massive room to demonstrate OpenSymbolicAI's advantage
- Tests three key challenges that map directly to our architecture:
- Compositional tasks demanding efficient orchestration of tool-call topologies → our primitive composition
- Implicit intent spread across dialogue turns requiring contextual inference → our decomposition examples
- Instruction transitions mixing task queries, clarifications, and conversation → our plan-then-execute separation
References
Tasks
- Review benchmark paper and dataset format
- Map WildToolBench tools to OpenSymbolicAI primitives
- Implement benchmark harness
- Run evaluation and collect results
- Document findings
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels