Skip to content

KevinSCUTer/SCUT-SWE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

SCUT-SWE

中文说明

SCUT-SWE is a community repository for evaluating coding models against real developer experience.

Originally initiated around the "Model Coding Hackathon Cup" ladder competition at the School of Software, South China University of Technology, this project exists for a simple reason: coding leaderboard scores are becoming increasingly distorted. Many models look great on paper, but feel much weaker in real hands-on use. We want to help ordinary developers tell the difference between models that are genuinely strong and models that mainly score well.

No marketing. Just evidence.

Why this repository exists

Public coding evaluations are often hard to trust because:

  • prompts differ across reports
  • agents and toolchains differ across reports
  • harnesses differ across reports
  • some benchmarks are overfit or selectively reported
  • high scores do not always translate into strong day-to-day coding performance

SCUT-SWE tries to put comparisons back on fairer ground:

  • same prompts
  • same coding agent
  • same harness
  • comparable task setup
  • transparent artifacts and write-ups

What this repository aims to provide

  • real-world coding tasks and stage materials
  • a reproducible evaluation setup
  • benchmark references for the coding-model landscape
  • result summaries that help people choose models, not just read leaderboard screenshots

Representative benchmarks in this space

Benchmark Hugging Face Notes
Terminal-Bench 2.0 https://huggingface.co/datasets/harborframework/terminal-bench-2.0 Evaluates terminal task execution such as CLI workflows
Toolathalon https://huggingface.co/datasets/ServiceNow/toolathalon Evaluates multi-tool usage ability
Expert-SWE (Internal) N/A Internal benchmark with no public Hugging Face release
SWE-Bench Pro https://huggingface.co/datasets/ScaleAI/SWE-bench_Pro Pro edition, often positioned as a verified or expanded subset
SWE-Bench Verified https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified Human-verified, higher-quality subset
Codeforces https://huggingface.co/datasets/open-r1/codeforces Competitive programming data, not a full official coding-agent benchmark

Repository layout

S0/
└── group-stage/
    └── online_markdown.md
  • S0/group-stage/online_markdown.md: requirement document for the current group-stage task about an online Markdown editor.

Project status

The repository is still in its early stage. We are currently organizing tasks, benchmark references, and competition materials. More evaluation assets may be added as the workflow matures.

Contributing

Issues and pull requests are welcome, especially for:

  • better task design
  • benchmark corrections or additions
  • reproducibility improvements
  • clearer reporting of model behavior in real use

If you report a result, please include the prompt, agent configuration, harness, and reproducible evidence whenever possible.

About

Official Repository for the "Model Coding Hackathon Cup" Ladder Competition at the School of Software, South China University of Technology

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors