feat(evals): add resume logic, API validation, Gemini fix, and max_turns#233
Open
akenginorhun wants to merge 1 commit intomainfrom
Open
feat(evals): add resume logic, API validation, Gemini fix, and max_turns#233akenginorhun wants to merge 1 commit intomainfrom
akenginorhun wants to merge 1 commit intomainfrom
Conversation
…rns config Core Pipeline Improvements: 1. Resume Logic (runner.py): - Skip already-completed samples when restarting eval runs - Loads existing results from JSONL and filters them out - Prevents wasted compute on long-running evals that crash/timeout 2. API Key Validation (eval_cmd.py): - Added validation helpers for Anthropic, OpenAI, Google, MiniMax - Probe models for testing API connectivity - Skip confirmation prompt when resuming existing runs 3. Gemini SDK Compatibility Fix (cloud.py): - Fix for Gemini SDK v1.70+ Pydantic validation errors - Rewrote message building to use native genai SDK types - Fixes multi-turn conversations with function calls 4. Max Turns Config (types.py): - Added max_turns as first-class config field - Configurable at benchmark, run, and execution levels - Enables proper budgeting for thinking/reasoning models These changes improve eval reliability, debugging, and compatibility without introducing benchmark-specific logic.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Core Pipeline Improvements:
Resume Logic (runner.py):
API Key Validation (eval_cmd.py):
Gemini SDK Compatibility Fix (cloud.py):
Max Turns Config (types.py):
These changes improve eval reliability, debugging, and compatibility without introducing benchmark-specific logic.