You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ProviderGateway (src/agentloom/providers/gateway.py) routes by static priority + fallback chain. #50 (strategy-based provider selection) covers cost/latency/priority strategies and #49 covers multi-key round-robin for rate limit distribution. Even with both of those, the gateway lacks several production-grade routing primitives:
Request hedging. Send the same request to two providers in parallel; use whichever responds first; cancel the other. Standard SRE pattern for tail-latency reduction. Today the gateway tries one provider at a time — if the first hangs near its timeout, the user waits the full timeout before fallback kicks in.
Request coalescing. Two callers with identical (messages, model, params) arriving simultaneously result in two upstream API calls. Coalescing collapses them to one, fanning out the response. For batched evaluation harnesses (PhD's H4 testing the same prompt across many scenarios) this can halve cost.
Proactive health checks. Today the circuit breaker only learns about provider health by attempting real requests. A provider that's been down for 30 minutes only gets retested when the next workflow happens to route to it. Periodic background probes detect recovery faster.
Admission control / backpressure. Under burst load (1000 workflows started in the same second), all 1000 hit the gateway, all 1000 enqueue at the rate limiter, all 1000 wait indefinitely. There is no "reject if queue depth > N" — the system has no graceful degradation under overload.
Proposal
Four independent gateway features that can ship in any order. Each is small enough for a single PR.
1. Request hedging:
- id: critical_steptype: llm_callprompt: "..."hedge:
enabled: truedelay_ms: 500# wait this long before firing the second requestmax_parallel: 2# most providers in race
Implementation: in gateway.complete(), when hedge.enabled, fire the primary request immediately and start a timer. After delay_ms, fire the second request to the next candidate. Race them via anyio.move_on_after + task group. First to return wins; the other is cancelled. Cancellation goes through the same GeneratorExit path as #106 — it must NOT count as a circuit breaker failure.
Default off — opt-in per step. Doubles cost when both win the race against the timer (rare but real).
2. Request coalescing:
config:
coalescing:
enabled: truewindow_ms: 100# requests within this window with identical key are coalesced
Implementation: in gateway.complete(), hash (messages, model, temperature, max_tokens, kwargs). If an in-flight request with the same hash exists and started within window_ms, await its result instead of issuing a new request. Both callers receive the same ProviderResponse. Cost is attributed evenly across coalesced callers (or, simpler: to the first caller; document the choice).
Default off — opt-in via config. Useful only when AgentLoom processes are long-lived (servers, batch runs).
3. Proactive health checks:
classProviderEntry:
health_check_interval_s: float=60.0health_check_endpoint: str="/health"# or a cheap "ping" model call
A background task per provider periodically probes — sends a minimal request and records latency. On failure, increments a "passive failure" counter that, if it crosses a threshold, opens the circuit even without real workflow traffic. On success in OPEN state, transitions to HALF_OPEN faster than the time-based recovery would allow.
Stops probing if the workflow is paused (no active workflows for N minutes) — don't burn quota when idle.
4. Admission control:
config:
admission:
max_queue_depth: 100# per provideron_overflow: reject # or "shed_oldest" / "block"
Implementation: rate limiter tracks pending acquires. If the count exceeds max_queue_depth, the next acquire() either:
reject: raises AdmissionRejectedError immediately. Caller decides what to do (typically: fail fast).
shed_oldest: cancels the oldest waiting request (it gets AdmissionRejectedError); admits the new one.
block: existing behavior (wait indefinitely).
Default block for backward compat. Production deployments configure to reject to fail fast under overload instead of building up a multi-minute backlog.
Scope
src/agentloom/providers/gateway.py — hedging logic, coalescing logic, admission control wiring.
src/agentloom/resilience/rate_limiter.py — admission control inside acquire().
src/agentloom/resilience/health_checker.py — new module with the periodic probe loop.
src/agentloom/core/models.py — HedgeConfig, CoalescingConfig, AdmissionConfig on WorkflowConfig / StepDefinition.
All four are opt-in. Defaults preserve current behavior.
Hedging and coalescing pull in opposite directions: hedging spends to lower latency, coalescing spends less when load patterns allow. They can coexist (different scopes — per-step vs gateway-wide).
Health checks burn API quota. Document the cost trade-off — useful for production, may be undesirable for short-lived CLI runs.
Coalescing semantics need care: how to attribute cost? Suggest: to the first caller, log others as "coalesced=true". Iterate on this.
Description
ProviderGateway(src/agentloom/providers/gateway.py) routes by static priority + fallback chain. #50 (strategy-based provider selection) covers cost/latency/priority strategies and #49 covers multi-key round-robin for rate limit distribution. Even with both of those, the gateway lacks several production-grade routing primitives:(messages, model, params)arriving simultaneously result in two upstream API calls. Coalescing collapses them to one, fanning out the response. For batched evaluation harnesses (PhD's H4 testing the same prompt across many scenarios) this can halve cost.Proposal
Four independent gateway features that can ship in any order. Each is small enough for a single PR.
1. Request hedging:
Implementation: in
gateway.complete(), whenhedge.enabled, fire the primary request immediately and start a timer. Afterdelay_ms, fire the second request to the next candidate. Race them viaanyio.move_on_after+ task group. First to return wins; the other is cancelled. Cancellation goes through the sameGeneratorExitpath as #106 — it must NOT count as a circuit breaker failure.Default off — opt-in per step. Doubles cost when both win the race against the timer (rare but real).
2. Request coalescing:
Implementation: in
gateway.complete(), hash(messages, model, temperature, max_tokens, kwargs). If an in-flight request with the same hash exists and started withinwindow_ms, await its result instead of issuing a new request. Both callers receive the sameProviderResponse. Cost is attributed evenly across coalesced callers (or, simpler: to the first caller; document the choice).Default off — opt-in via config. Useful only when AgentLoom processes are long-lived (servers, batch runs).
3. Proactive health checks:
A background task per provider periodically probes — sends a minimal request and records latency. On failure, increments a "passive failure" counter that, if it crosses a threshold, opens the circuit even without real workflow traffic. On success in OPEN state, transitions to HALF_OPEN faster than the time-based recovery would allow.
Stops probing if the workflow is paused (no active workflows for N minutes) — don't burn quota when idle.
4. Admission control:
Implementation: rate limiter tracks pending acquires. If the count exceeds
max_queue_depth, the nextacquire()either:reject: raisesAdmissionRejectedErrorimmediately. Caller decides what to do (typically: fail fast).shed_oldest: cancels the oldest waiting request (it getsAdmissionRejectedError); admits the new one.block: existing behavior (wait indefinitely).Default
blockfor backward compat. Production deployments configure torejectto fail fast under overload instead of building up a multi-minute backlog.Scope
src/agentloom/providers/gateway.py— hedging logic, coalescing logic, admission control wiring.src/agentloom/resilience/rate_limiter.py— admission control insideacquire().src/agentloom/resilience/health_checker.py— new module with the periodic probe loop.src/agentloom/core/models.py—HedgeConfig,CoalescingConfig,AdmissionConfigonWorkflowConfig/StepDefinition.src/agentloom/exceptions.py—AdmissionRejectedError.src/agentloom/observability/metrics.py— counters:agentloom_hedge_wins_total{primary_provider, winner_provider},agentloom_coalesced_requests_total,agentloom_admission_rejections_total{provider, reason},agentloom_health_checks_total{provider, status}.Regression tests
For each feature:
test_hedge_returns_first_responsetest_hedge_cancels_loser_without_circuit_failuretest_hedge_only_one_request_when_first_succeeds_before_delaytest_coalesce_combines_identical_requests_within_windowtest_coalesce_does_not_combine_requests_with_different_kwargstest_coalesce_propagates_failure_to_all_waiterstest_health_check_runs_periodically_when_workflows_activetest_health_check_stops_when_idletest_health_check_failure_opens_circuittest_admission_reject_raises_immediately_when_queue_fulltest_admission_shed_oldest_cancels_oldest_waitertest_admission_block_existing_behavior_unchangedNotes
GeneratorExit-aware path that fix gateway resilience: CB/RL ordering, stream cancellation, retry jitter, rate-limiter edge cases #106 fixes, not the current path that erodes circuit health.