SCUT-SWE is a community repository for evaluating coding models against real developer experience.
Originally initiated around the "Model Coding Hackathon Cup" ladder competition at the School of Software, South China University of Technology, this project exists for a simple reason: coding leaderboard scores are becoming increasingly distorted. Many models look great on paper, but feel much weaker in real hands-on use. We want to help ordinary developers tell the difference between models that are genuinely strong and models that mainly score well.
No marketing. Just evidence.
Public coding evaluations are often hard to trust because:
- prompts differ across reports
- agents and toolchains differ across reports
- harnesses differ across reports
- some benchmarks are overfit or selectively reported
- high scores do not always translate into strong day-to-day coding performance
SCUT-SWE tries to put comparisons back on fairer ground:
- same prompts
- same coding agent
- same harness
- comparable task setup
- transparent artifacts and write-ups
- real-world coding tasks and stage materials
- a reproducible evaluation setup
- benchmark references for the coding-model landscape
- result summaries that help people choose models, not just read leaderboard screenshots
| Benchmark | Hugging Face | Notes |
|---|---|---|
| Terminal-Bench 2.0 | https://huggingface.co/datasets/harborframework/terminal-bench-2.0 | Evaluates terminal task execution such as CLI workflows |
| Toolathalon | https://huggingface.co/datasets/ServiceNow/toolathalon | Evaluates multi-tool usage ability |
| Expert-SWE (Internal) | N/A | Internal benchmark with no public Hugging Face release |
| SWE-Bench Pro | https://huggingface.co/datasets/ScaleAI/SWE-bench_Pro | Pro edition, often positioned as a verified or expanded subset |
| SWE-Bench Verified | https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified | Human-verified, higher-quality subset |
| Codeforces | https://huggingface.co/datasets/open-r1/codeforces | Competitive programming data, not a full official coding-agent benchmark |
S0/
└── group-stage/
└── online_markdown.md
S0/group-stage/online_markdown.md: requirement document for the current group-stage task about an online Markdown editor.
The repository is still in its early stage. We are currently organizing tasks, benchmark references, and competition materials. More evaluation assets may be added as the workflow matures.
Issues and pull requests are welcome, especially for:
- better task design
- benchmark corrections or additions
- reproducibility improvements
- clearer reporting of model behavior in real use
If you report a result, please include the prompt, agent configuration, harness, and reproducible evidence whenever possible.