Skip to content

[codex] Classify network errors and auto-route model fallbacks#368

Draft
furukama wants to merge 2 commits intomainfrom
codex/network-error-classification-fallback
Draft

[codex] Classify network errors and auto-route model fallbacks#368
furukama wants to merge 2 commits intomainfrom
codex/network-error-classification-fallback

Conversation

@furukama
Copy link
Copy Markdown
Contributor

What changed

This teaches the container runtime to classify model transport and upstream failures into actionable categories instead of surfacing a generic fetch/network error.

  • add a dedicated model error classifier for DNS, TLS, timeout/network, generic 5xx, and provider-outage failures
  • use that classification in the retry and stream-downgrade path so we only retry or route when the failure class warrants it
  • route eligible failures to configured agent fallback models by passing resolved fallback targets from the gateway into the container runtime
  • preserve provider/model-specific failure summaries when every route is exhausted
  • add focused tests for the classifier and agent fallback resolution helper

Why this changed

The runtime previously relied on coarse regex checks around fetch failed-style errors. That made different failure modes look the same to users and prevented the system from making better routing decisions when a provider or network path was temporarily unhealthy.

Impact

  • DNS failures now surface as DNS lookup failures instead of generic fetch errors
  • TLS failures now surface with certificate/scheme guidance instead of generic fetch errors
  • upstream 5xx and provider-outage failures can auto-route to configured fallback models
  • non-routeable failures such as auth, rate-limit, and bad-request errors still stop on the active route and surface directly

Root cause

Fallback behavior and error reporting were driven by separate coarse heuristics in the container model retry path, and agent model.fallbacks were not being resolved into runtime targets for the container execution path.

Validation

  • npm run format
  • npm run typecheck
  • vitest tests/hybridai-retry.test.ts tests/model-error-classification.test.ts tests/agent-model-fallbacks.test.ts

Notes

The focused fallback-helper test emits a startup warning from runtime-config because the shared better-sqlite3 build in this environment targets a different Node ABI, but the test itself passes and the new helper coverage does not depend on SQLite.

@furukama furukama force-pushed the codex/network-error-classification-fallback branch from dfde88e to c3e606d Compare April 20, 2026 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant