-
Notifications
You must be signed in to change notification settings - Fork 6
Benchmark: SWE-EVO — long-horizon software evolution #27
Copy link
Copy link
Open
Description
Overview
Evaluate OpenSymbolicAI against SWE-EVO (Dec 2025) — a benchmark for long-horizon software evolution requiring multi-step modifications spanning an average of 21 files per task.
Why this benchmark
- Tests real-world software evolution: interpreting high-level requirements, coordinating cross-file changes, evolving codebases iteratively
- 48 tasks from 7 mature open-source Python projects
- Goes beyond isolated single-issue tasks (unlike SWE-bench)
- GoalSeeking blueprint could handle the iterative refinement aspect
References
Tasks
- Review SWE-EVO paper and task format
- Design primitive set for code modification operations
- Determine which blueprint (DesignExecute vs GoalSeeking) fits best
- Implement benchmark harness
- Run evaluation and collect results
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels