Benchmark: SWE-EVO — long-horizon software evolution

## Overview
Evaluate OpenSymbolicAI against **SWE-EVO** (Dec 2025) — a benchmark for long-horizon software evolution requiring multi-step modifications spanning an average of 21 files per task.

## Why this benchmark
- Tests **real-world software evolution**: interpreting high-level requirements, coordinating cross-file changes, evolving codebases iteratively
- 48 tasks from 7 mature open-source Python projects
- Goes beyond isolated single-issue tasks (unlike SWE-bench)
- GoalSeeking blueprint could handle the iterative refinement aspect

## References
- [SWE-EVO (arXiv)](https://arxiv.org/abs/2512.18470)

## Tasks
- [ ] Review SWE-EVO paper and task format
- [ ] Design primitive set for code modification operations
- [ ] Determine which blueprint (DesignExecute vs GoalSeeking) fits best
- [ ] Implement benchmark harness
- [ ] Run evaluation and collect results

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark: SWE-EVO — long-horizon software evolution #27

Overview

Why this benchmark

References

Tasks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark: SWE-EVO — long-horizon software evolution #27

Description

Overview

Why this benchmark

References

Tasks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions