Skip to content

Questions for the Main Performance #3

@ZimaBlue307

Description

@ZimaBlue307
  1. I checked the SWE-Bench Leaderboard, TRAE + Doubao-Seed-Code achieved the best performance on SWE-Bench Verified (09/28/2025), which is 78.80%. However, the results in figure 1 is 70.60%. Is it becasue the test-time scaling setup, or other reason to cause the difference?
Image
  1. The best perfromance from the paper achieved by Live-SWE-Agent (77.40%) is based on Gemini 3 Pro Preview (2025-11-18), which could be one of the latest and the strongest model. And it seems only two performance from the leaderborad is from Gemini 3 Pro Preview (2025-11-18), another is from mini-SWE-Agent (74.20%). It do make sense that Live-SWE-Agent will perform better than mini-SWE-Agent with Gemini 3 Pro Preview. But what the results look like for SWE-Agent or OpenHands with the same base model? As mini-SWE-Agent + Gemini 3 Pro Preview already achieved 74.20%, chaning mini-SWE-Agent to SWE-Agent or OpenHands has a great probablity for a performance higher than 74.20% (with the same base model). In this case, if these static agent pipelines perform good or even better than Live-SWE-Agent, does that mean: "the freedom" of some agent pipeline provided is already sufficient to let strong model to process the task, while tool-self-evolving on the fly may provide extra burdden to the model and hurt the performance?

Thanks for your valuable insights!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions