Skip to content

Benchmark: SWE-EVO — long-horizon software evolution #27

@rajkumar42

Description

@rajkumar42

Overview

Evaluate OpenSymbolicAI against SWE-EVO (Dec 2025) — a benchmark for long-horizon software evolution requiring multi-step modifications spanning an average of 21 files per task.

Why this benchmark

  • Tests real-world software evolution: interpreting high-level requirements, coordinating cross-file changes, evolving codebases iteratively
  • 48 tasks from 7 mature open-source Python projects
  • Goes beyond isolated single-issue tasks (unlike SWE-bench)
  • GoalSeeking blueprint could handle the iterative refinement aspect

References

Tasks

  • Review SWE-EVO paper and task format
  • Design primitive set for code modification operations
  • Determine which blueprint (DesignExecute vs GoalSeeking) fits best
  • Implement benchmark harness
  • Run evaluation and collect results

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions