feat: Detect platform-side inference errors#332
feat: Detect platform-side inference errors#332statxc wants to merge 4 commits intoridgesai:mainfrom
Conversation
…d for provider failures
|
@camfairchild Could you please review the PR? I'd appreciate any feedbacks. |
inference_gateway/main.py
Outdated
| def is_non_halting_error(status_code: int) -> bool: | ||
| return status_code in NON_HALTING_ERROR_CODES |
There was a problem hiding this comment.
Would be better termed "platform error" or something.
By "halting error" I meant one that isn't caught properly and halts the process
|
Looks good otherwise. Thank you |
|
@camfairchild Thanks for your feedback. I updated name to |
…d embedding and edge case tests
|
@ibraheem-abe Could you please review this PR? Welcome to any feedbacks. |
|
please give me any feedbacks |
|
@statxc We want to change this so only that specific test is retried instead of the entire thing |
|
Hey @ibraheem-abe, thanks for the feedback! I think a single-run retry is out of scope for this PR. #331 was about detecting platform-side inference errors and stopping early, and this PR addresses that. Implementing retries from our side is difficult: the platform state machine is one-way (once a run reaches As an interim fix, I could update the SQL view so error 3050 doesn't fail the entire evaluation - failed runs would score 0 and the other results would be preserved. But I think the actual retry behavior is better handled in a separate PR. |
…ailing entire evaluation Based on ridgesai#332 by @statxc which detects platform-side inference errors. Changes the behavior so that when an evaluation run hits the inference error threshold, only that specific run is retried (up to 2 times) instead of marking the entire evaluation as failed. Flow: 1. Agent finishes → validator checks /api/usage for inference errors 2. If errors >= threshold and retries remaining: - Reset error counter via POST /api/reset-inference-errors - Re-run only this specific problem (not the whole evaluation) 3. If errors >= threshold and retries exhausted: - Mark as PLATFORM_TOO_MANY_INFERENCE_ERRORS (3050) New additions on top of ridgesai#332: - ErrorHashMap.reset_inference_errors() method - POST /api/reset-inference-errors gateway endpoint - Retry loop in _run_evaluation_run() with MAX_SINGLE_RUN_RETRIES=2 - Tests for reset behavior Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@ibraheem-abe Sorry to bother you again. I'd appreciate you approve this PR if it has no problem. |
|
@statxc |
Detect platform-side inference errors so agents aren't penalized for provider failures
Closes #331
Problem
When an AI provider goes down or returns server errors (500, 502, etc.), the agent's
inference()calls returnNone. The agent keeps running but produces a bad or empty patch because it has no LLM to work with. The platform then scores this patch normally - the agent gets a 0 for something that wasn't its fault.There was no mechanism to distinguish "the agent wrote bad code" from "the providers were broken."
Solution
Track platform-side inference errors per evaluation run and flag the run as a platform error when the count exceeds a configurable threshold.
Platform errors are provider failures that the agent can't control:
500Internal Server Error502Bad Gateway503Service Unavailable504Gateway Timeout-1Internal provider errorNon-platform errors (400, 404, 422, 429) are excluded - those are the agent's fault (bad request, wrong model, exceeded cost limit).
What changed
inference_gateway/error_hash_map.py(new)ErrorHashMapclass that tracks inference error counts perevaluation_run_id, with the same auto-cleanup pattern as the existingCostHashMap.inference_gateway/config.pyMAX_INFERENCE_ERRORS_PER_EVALUATION_RUN(defaults to 5 if not set in.env). Existing deployments won't break.inference_gateway/main.py503once the threshold is hit. Extended/api/usageto includeinference_errorsandmax_inference_errors. Addedlogger.warning()when errors are counted and when threshold blocks a request.models/evaluation_run.pyPLATFORM_TOO_MANY_INFERENCE_ERRORS = 3050in the 3xxx platform error range.validator/main.py/api/usageon the inference gateway with a 10s timeout. If errors exceed the limit, marks the run as a platform error (3050) instead of scoring the patch. Also wired up theextrafield inEvaluationRunExceptionhandling - it was designed but never passed through. Nowagent_logsare included when reporting platform errors.tests/test_inference_error_tracking.py(new)ErrorHashMapunit behavior, platform error classification, error code validation, and integration tests against both inference and embedding gateway endpoints.How it works end-to-end
Config
Add to your
.envif you want to override the default:Testing