Skip to content

feat(evals): add resume logic, API validation, Gemini fix, and max_turns#233

Open
akenginorhun wants to merge 1 commit intomainfrom
feat/eval-pipeline-improvements
Open

feat(evals): add resume logic, API validation, Gemini fix, and max_turns#233
akenginorhun wants to merge 1 commit intomainfrom
feat/eval-pipeline-improvements

Conversation

@akenginorhun
Copy link
Copy Markdown
Collaborator

Core Pipeline Improvements:

  1. Resume Logic (runner.py):

    • Skip already-completed samples when restarting eval runs
    • Loads existing results from JSONL and filters them out
    • Prevents wasted compute on long-running evals that crash/timeout
  2. API Key Validation (eval_cmd.py):

    • Added validation helpers for Anthropic, OpenAI, Google, MiniMax
    • Probe models for testing API connectivity
    • Skip confirmation prompt when resuming existing runs
  3. Gemini SDK Compatibility Fix (cloud.py):

    • Fix for Gemini SDK v1.70+ Pydantic validation errors
    • Rewrote message building to use native genai SDK types
    • Fixes multi-turn conversations with function calls
  4. Max Turns Config (types.py):

    • Added max_turns as first-class config field
    • Configurable at benchmark, run, and execution levels
    • Enables proper budgeting for thinking/reasoning models

These changes improve eval reliability, debugging, and compatibility without introducing benchmark-specific logic.

…rns config

Core Pipeline Improvements:

1. Resume Logic (runner.py):
   - Skip already-completed samples when restarting eval runs
   - Loads existing results from JSONL and filters them out
   - Prevents wasted compute on long-running evals that crash/timeout

2. API Key Validation (eval_cmd.py):
   - Added validation helpers for Anthropic, OpenAI, Google, MiniMax
   - Probe models for testing API connectivity
   - Skip confirmation prompt when resuming existing runs

3. Gemini SDK Compatibility Fix (cloud.py):
   - Fix for Gemini SDK v1.70+ Pydantic validation errors
   - Rewrote message building to use native genai SDK types
   - Fixes multi-turn conversations with function calls

4. Max Turns Config (types.py):
   - Added max_turns as first-class config field
   - Configurable at benchmark, run, and execution levels
   - Enables proper budgeting for thinking/reasoning models

These changes improve eval reliability, debugging, and compatibility
without introducing benchmark-specific logic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant