Benchmark: WildToolBench — real-world tool composition

## Overview
Evaluate OpenSymbolicAI against the **WildToolBench** benchmark — an LLM tool-use benchmark grounded in real-world user behavior patterns.

## Why this benchmark
- **No model exceeds 15% accuracy** across 57 LLMs tested — massive room to demonstrate OpenSymbolicAI's advantage
- Tests three key challenges that map directly to our architecture:
  1. **Compositional tasks** demanding efficient orchestration of tool-call topologies → our primitive composition
  2. **Implicit intent** spread across dialogue turns requiring contextual inference → our decomposition examples
  3. **Instruction transitions** mixing task queries, clarifications, and conversation → our plan-then-execute separation

## References
- [WildToolBench (OpenReview)](https://openreview.net/forum?id=yz7fL5vfpn)

## Tasks
- [ ] Review benchmark paper and dataset format
- [ ] Map WildToolBench tools to OpenSymbolicAI primitives
- [ ] Implement benchmark harness
- [ ] Run evaluation and collect results
- [ ] Document findings

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark: WildToolBench — real-world tool composition #23

Overview

Why this benchmark

References

Tasks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark: WildToolBench — real-world tool composition #23

Description

Overview

Why this benchmark

References

Tasks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions