-
Notifications
You must be signed in to change notification settings - Fork 15
Open
Description
- I checked the SWE-Bench Leaderboard, TRAE + Doubao-Seed-Code achieved the best performance on SWE-Bench Verified (09/28/2025), which is 78.80%. However, the results in figure 1 is 70.60%. Is it becasue the test-time scaling setup, or other reason to cause the difference?
- The best perfromance from the paper achieved by Live-SWE-Agent (77.40%) is based on Gemini 3 Pro Preview (2025-11-18), which could be one of the latest and the strongest model. And it seems only two performance from the leaderborad is from Gemini 3 Pro Preview (2025-11-18), another is from mini-SWE-Agent (74.20%). It do make sense that Live-SWE-Agent will perform better than mini-SWE-Agent with Gemini 3 Pro Preview. But what the results look like for SWE-Agent or OpenHands with the same base model? As mini-SWE-Agent + Gemini 3 Pro Preview already achieved 74.20%, chaning mini-SWE-Agent to SWE-Agent or OpenHands has a great probablity for a performance higher than 74.20% (with the same base model). In this case, if these static agent pipelines perform good or even better than Live-SWE-Agent, does that mean: "the freedom" of some agent pipeline provided is already sufficient to let strong model to process the task, while tool-self-evolving on the fly may provide extra burdden to the model and hurt the performance?
Thanks for your valuable insights!
Metadata
Metadata
Assignees
Labels
No labels