SCUT-SWE

SCUT-SWE is a community repository for evaluating coding models against real developer experience.

Originally initiated around the "Model Coding Hackathon Cup" ladder competition at the School of Software, South China University of Technology, this project exists for a simple reason: coding leaderboard scores are becoming increasingly distorted. Many models look great on paper, but feel much weaker in real hands-on use. We want to help ordinary developers tell the difference between models that are genuinely strong and models that mainly score well.

No marketing. Just evidence.

Why this repository exists

Public coding evaluations are often hard to trust because:

prompts differ across reports
agents and toolchains differ across reports
harnesses differ across reports
some benchmarks are overfit or selectively reported
high scores do not always translate into strong day-to-day coding performance

SCUT-SWE tries to put comparisons back on fairer ground:

same prompts
same coding agent
same harness
comparable task setup
transparent artifacts and write-ups

What this repository aims to provide

real-world coding tasks and stage materials
a reproducible evaluation setup
benchmark references for the coding-model landscape
result summaries that help people choose models, not just read leaderboard screenshots

Representative benchmarks in this space

Benchmark	Hugging Face	Notes
Terminal-Bench 2.0	https://huggingface.co/datasets/harborframework/terminal-bench-2.0	Evaluates terminal task execution such as CLI workflows
Toolathalon	https://huggingface.co/datasets/ServiceNow/toolathalon	Evaluates multi-tool usage ability
Expert-SWE (Internal)	N/A	Internal benchmark with no public Hugging Face release
SWE-Bench Pro	https://huggingface.co/datasets/ScaleAI/SWE-bench_Pro	Pro edition, often positioned as a verified or expanded subset
SWE-Bench Verified	https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified	Human-verified, higher-quality subset
Codeforces	https://huggingface.co/datasets/open-r1/codeforces	Competitive programming data, not a full official coding-agent benchmark

Repository layout

S0/
└── group-stage/
    └── online_markdown.md

S0/group-stage/online_markdown.md: requirement document for the current group-stage task about an online Markdown editor.

Project status

The repository is still in its early stage. We are currently organizing tasks, benchmark references, and competition materials. More evaluation assets may be added as the workflow matures.

Contributing

Issues and pull requests are welcome, especially for:

better task design
benchmark corrections or additions
reproducibility improvements
clearer reporting of model behavior in real use

If you report a result, please include the prompt, agent configuration, harness, and reproducible evidence whenever possible.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
S0/group-stage		S0/group-stage
README.md		README.md
README_zh.md		README_zh.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SCUT-SWE

Why this repository exists

What this repository aims to provide

Representative benchmarks in this space

Repository layout

Project status

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

SCUT-SWE

Why this repository exists

What this repository aims to provide

Representative benchmarks in this space

Repository layout

Project status

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages