feat(evals): add resume logic, API validation, Gemini fix, and max_turns by akenginorhun · Pull Request #233 · open-jarvis/OpenJarvis

akenginorhun · 2026-04-10T19:46:40Z

Core Pipeline Improvements:

Resume Logic (runner.py):
- Skip already-completed samples when restarting eval runs
- Loads existing results from JSONL and filters them out
- Prevents wasted compute on long-running evals that crash/timeout
API Key Validation (eval_cmd.py):
- Added validation helpers for Anthropic, OpenAI, Google, MiniMax
- Probe models for testing API connectivity
- Skip confirmation prompt when resuming existing runs
Gemini SDK Compatibility Fix (cloud.py):
- Fix for Gemini SDK v1.70+ Pydantic validation errors
- Rewrote message building to use native genai SDK types
- Fixes multi-turn conversations with function calls
Max Turns Config (types.py):
- Added max_turns as first-class config field
- Configurable at benchmark, run, and execution levels
- Enables proper budgeting for thinking/reasoning models

These changes improve eval reliability, debugging, and compatibility without introducing benchmark-specific logic.

…rns config Core Pipeline Improvements: 1. Resume Logic (runner.py): - Skip already-completed samples when restarting eval runs - Loads existing results from JSONL and filters them out - Prevents wasted compute on long-running evals that crash/timeout 2. API Key Validation (eval_cmd.py): - Added validation helpers for Anthropic, OpenAI, Google, MiniMax - Probe models for testing API connectivity - Skip confirmation prompt when resuming existing runs 3. Gemini SDK Compatibility Fix (cloud.py): - Fix for Gemini SDK v1.70+ Pydantic validation errors - Rewrote message building to use native genai SDK types - Fixes multi-turn conversations with function calls 4. Max Turns Config (types.py): - Added max_turns as first-class config field - Configurable at benchmark, run, and execution levels - Enables proper budgeting for thinking/reasoning models These changes improve eval reliability, debugging, and compatibility without introducing benchmark-specific logic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evals): add resume logic, API validation, Gemini fix, and max_turns#233

feat(evals): add resume logic, API validation, Gemini fix, and max_turns#233
akenginorhun wants to merge 1 commit intomainfrom
feat/eval-pipeline-improvements

akenginorhun commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

akenginorhun commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant