You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Capability-aware multi-provider failover with budget gates
Why now
While dogfooding #225's decompose flow on OpenRouter free tier, I hit 4 consecutive 429 Provider returned error... is temporarily rate-limited upstream on deepseek/deepseek-v4-flash and deepseek/deepseek-v4-pro. Agent gave up. There's no auto-failover, so any user who steps away mid-decompose comes back to a dead session with the toast errors and partial state.
OpenRouter's free models specifically have aggressive free-models-per-min rate limits. This isn't a bug in OpenRouter — it's a contract. But it is a fatal failure mode for hands-off agentic flows unless we fail over.
The plumbing for this already exists (multi-provider config, the model-discovery config:v1:test-endpoint we use for "Found 355 models", typed complete() abstraction, generation events). What's missing is the routing layer.
Proposed shape — two layers
OpenRouter and most modern provider gateways have their own native failover (provider.allow_fallbacks: true + provider.order: [...]). But that only handles transient failures within a single gateway. When the whole gateway dies, or capability requirements aren't met, you need a client-side layer.
Layer 1: Server-side native routing (free, fixes transient). Set provider.allow_fallbacks: true + provider.order: [...] in the OpenAI-compat headers when baseUrl is OpenRouter. Handles 429/5xx within OpenRouter's pool with no client-side complexity.
Layer 2: Client-side cross-provider failover (catches when OpenRouter itself is down). Capability-aware resolver: given (currentModel, requiredCapabilities, budgetRemaining), return ordered candidate list. Same-tier-or-better only, never silent downgrade.
Anti-pattern to actively avoid
Cascading degradation. Naive failover chains gpt-5 → claude-haiku → gemini-flash → tiny-7B and silently retries all of them. User gets "completed" job with terrible output and no idea why. Three guardrails:
Capability tier preserved or improved, never lowered — chain resolver filters candidates by tier >= currentTier. Fall sideways or up, never down.
Every switch is visible — new model_switched chat message kind ("Switched deepseek-v4-flash → kimi-k2.6 because: 429 rate-limit, capability tier preserved. Continuing your task.") forces user-aware degradation
Capability mismatch is honest failure — if vision is required and no vision-capable fallback exists, return {ok: false, error: 'no_capable_fallback', tried: [...]} (HONEST_STATUS rule), don't silently strip the image
Implementation outline (8 PRs, sequenced)
#
PR
Effort
Highest leverage?
1
OpenRouter native routing config — set provider.allow_fallbacks: true + provider.order: [...] in OpenAI client headers when baseUrl is OpenRouter
~30 min
Ship first, fixes today's failure with one config change
2
packages/core/src/fallback/model-capability-registry.ts — small JSON map of known model IDs → tags {vision, tools, longContext, tier: 'frontier'|'strong'|'fast'|'tiny'}, inferred from ID with override list
~1h
enables rest of plan
3
packages/core/src/fallback/chain-resolver.ts — given (currentModel, requiredCaps, budgetRemaining) returns ordered candidate list. Same-tier-or-better only
~1.5h
the brain
4
Wire complete() in packages/providers/src/index.ts to catch 429/5xx/timeout/upstream-error, invoke resolver, swap model, retry once with new model, emit typed model_switched event
Budget gates — hard caps {perTurnUsd, perSessionUsd, perDesignUsd}. When exceeded, pause job + push toast "Budget cap hit — resume?" rather than silent runaway
LESSONS.md integration (cross-link with the rollback Discussion) — append {model: 'A', failedWith: 429, succeededWith: 'B', count: N} per failover. Resolver reads this on next turn to upgrade B's priority for this design specifically
~1h
makes the system learn its own infra
PR-1 alone is the cheapest meaningful ship and would have prevented today's specific failure.
Risks & Mitigations
Risk
Mitigation
Silent capability downgrade
Tier-floor enforcement in resolver + visible model_switched card
Runaway cost
Hard budget caps (turn/session/design); pause-on-exceed not silent-on-exceed
Tool-call format drift across providers
Capability tag includes toolCallFormat; only swap within same format family
Stale fallback chain references dead models
Validate against discovered models on app start; prune missing ones with notification
OpenRouter native fallback hides which model actually answered
Surface x-router-model response header in chat audit trail
Failover loop (B fails, retry → A fails, retry → B fails…)
Per-turn retry budget = 3; after that, surface honest failure with full chain attempt log
Alignment check before code
Capability tag scope — start with hand-curated map of ~30 known models, or invest upfront in auto-extraction from provider metadata? My lean: hand-curated for v1, auto-extraction once the map proves the right shape.
Budget enforcement strictness — soft warning + ask, vs hard pause? My lean: hard pause for perTurnUsd, soft warning for perDesignUsd. Configurable.
Should the resolver be deterministic, or should it weight by recent success/failure rate? Deterministic is debuggable; weighted is better behavior. My lean: deterministic for v1 with LESSONS.md as the explicit "weighting" channel (item 8 above).
OpenRouter provider.order config — global default, per-design override, both? My lean: both, with sensible defaults.
Sibling Discussions: "Time-travel chat: snapshot + rollback + lessons subsystem", "Spiral detector: auto-pause after N consecutive non-progressive turns"
Item 8 (LESSONS.md integration) intentionally couples to the rollback Discussion — the same per-design memory that learns from semantic failures should learn from infrastructure failures
Reference fork-state evidence
The 4× 429 spiral that motivated this is reproducible by setting OpenRouter as the active provider, picking any free-tier model, and running the decompose flow on a moderately complex artifact. The agent makes 4-6 tool calls in close succession; OpenRouter's free-models-per-min cap kicks in around call 3-4; the existing error path doesn't retry-with-fallback so generation dies.
Looking forward to direction before writing any of the actual code. PR-1 (OpenRouter native routing, ~30 min) could ship as a one-off if there's appetite while we discuss the broader shape.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Capability-aware multi-provider failover with budget gates
Why now
While dogfooding #225's decompose flow on OpenRouter free tier, I hit 4 consecutive
429 Provider returned error... is temporarily rate-limited upstreamondeepseek/deepseek-v4-flashanddeepseek/deepseek-v4-pro. Agent gave up. There's no auto-failover, so any user who steps away mid-decompose comes back to a dead session with the toast errors and partial state.OpenRouter's free models specifically have aggressive
free-models-per-minrate limits. This isn't a bug in OpenRouter — it's a contract. But it is a fatal failure mode for hands-off agentic flows unless we fail over.The plumbing for this already exists (multi-provider config, the model-discovery
config:v1:test-endpointwe use for "Found 355 models", typedcomplete()abstraction, generation events). What's missing is the routing layer.Proposed shape — two layers
OpenRouter and most modern provider gateways have their own native failover (
provider.allow_fallbacks: true+provider.order: [...]). But that only handles transient failures within a single gateway. When the whole gateway dies, or capability requirements aren't met, you need a client-side layer.Layer 1: Server-side native routing (free, fixes transient). Set
provider.allow_fallbacks: true+provider.order: [...]in the OpenAI-compat headers when baseUrl is OpenRouter. Handles 429/5xx within OpenRouter's pool with no client-side complexity.Layer 2: Client-side cross-provider failover (catches when OpenRouter itself is down). Capability-aware resolver: given
(currentModel, requiredCapabilities, budgetRemaining), return ordered candidate list. Same-tier-or-better only, never silent downgrade.Anti-pattern to actively avoid
Cascading degradation. Naive failover chains
gpt-5 → claude-haiku → gemini-flash → tiny-7Band silently retries all of them. User gets "completed" job with terrible output and no idea why. Three guardrails:tier >= currentTier. Fall sideways or up, never down.model_switchedchat message kind ("Switcheddeepseek-v4-flash→kimi-k2.6because: 429 rate-limit, capability tier preserved. Continuing your task.") forces user-aware degradation{ok: false, error: 'no_capable_fallback', tried: [...]}(HONEST_STATUS rule), don't silently strip the imageImplementation outline (8 PRs, sequenced)
provider.allow_fallbacks: true+provider.order: [...]in OpenAI client headers when baseUrl is OpenRouterpackages/core/src/fallback/model-capability-registry.ts— small JSON map of known model IDs → tags{vision, tools, longContext, tier: 'frontier'|'strong'|'fast'|'tiny'}, inferred from ID with override listpackages/core/src/fallback/chain-resolver.ts— given(currentModel, requiredCaps, budgetRemaining)returns ordered candidate list. Same-tier-or-better onlycomplete()inpackages/providers/src/index.tsto catch 429/5xx/timeout/upstream-error, invoke resolver, swap model, retry once with new model, emit typedmodel_switchedeventmodel_switchedchat message kind — grey card with from/to/reason. Visible audit trail, non-blocking{perTurnUsd, perSessionUsd, perDesignUsd}. When exceeded, pause job + push toast "Budget cap hit — resume?" rather than silent runawayLESSONS.mdintegration (cross-link with the rollback Discussion) — append{model: 'A', failedWith: 429, succeededWith: 'B', count: N}per failover. Resolver reads this on next turn to upgrade B's priority for this design specificallyPR-1 alone is the cheapest meaningful ship and would have prevented today's specific failure.
Risks & Mitigations
model_switchedcardtoolCallFormat; only swap within same format familyx-router-modelresponse header in chat audit trailAlignment check before code
perTurnUsd, soft warning forperDesignUsd. Configurable.LESSONS.mdas the explicit "weighting" channel (item 8 above).provider.orderconfig — global default, per-design override, both? My lean: both, with sensible defaults.Cross-references
LESSONS.mdintegration) intentionally couples to the rollback Discussion — the same per-design memory that learns from semantic failures should learn from infrastructure failuresReference fork-state evidence
The 4× 429 spiral that motivated this is reproducible by setting OpenRouter as the active provider, picking any free-tier model, and running the decompose flow on a moderately complex artifact. The agent makes 4-6 tool calls in close succession; OpenRouter's
free-models-per-mincap kicks in around call 3-4; the existing error path doesn't retry-with-fallback so generation dies.Looking forward to direction before writing any of the actual code. PR-1 (OpenRouter native routing, ~30 min) could ship as a one-off if there's appetite while we discuss the broader shape.
Beta Was this translation helpful? Give feedback.
All reactions