Skip to content

Adjusted tests#5

Open
IIEleven11 wants to merge 2 commits intoT3-Content:mainfrom
IIEleven11:main
Open

Adjusted tests#5
IIEleven11 wants to merge 2 commits intoT3-Content:mainfrom
IIEleven11:main

Conversation

@IIEleven11
Copy link

Not sure how much you wanted to work on this test. I know its just a small thing you did. I just added some questions that would challenge their spatial understanding a bit more. Focusing less on the names of things and more on direction and body placement.

I changed the code around so I could do it locally with some 120b models. It would appear that these models are just guessing. They all score fairly low and when tested multiple times they will choose different answers to the same questions, no consistency.

A potential problem with these new test questions is there are only two possible answers. left or right. So its much more likely to guess correctly as opposed to your original test questions where theres a large variety of names to choose from.

@vercel
Copy link

vercel bot commented Dec 24, 2025

Someone is attempting to deploy a commit to the Theo's projects Team on Vercel.

A member of the Team first needs to authorize it.

@greptile-apps
Copy link

greptile-apps bot commented Dec 24, 2025

Important Files Changed

Filename Overview
bench/tests/skate-trick-test.json Updated system prompt with stance definitions and added 5 new spatial reasoning questions focusing on directional movement and body positioning

Confidence score: 3/5

  • This PR introduces a significant change to test methodology that could affect benchmark validity and comparability with previous results
  • Score reflects concerns about the binary answer format potentially inflating success rates through random guessing, and the lack of validation for the correctness of the new spatial reasoning questions
  • Pay close attention to bench/tests/skate-trick-test.json to verify the accuracy of the new questions and consider the impact on benchmark scoring methodology

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (2)

  1. bench/tests/skate-trick-test.json, line 52 (link)

    syntax: Typo: 'backside 900 on a.' appears incomplete - missing the object (halfpipe, vert ramp, etc.)

  2. bench/tests/skate-trick-test.json, line 60 (link)

    syntax: Typo: 'striaght' should be 'straight'

1 file reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

fixed the typo and missing "half pipe"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant