add gateway request hedging, coalescing, health checks, admission control

### Description

`ProviderGateway` (`src/agentloom/providers/gateway.py`) routes by static priority + fallback chain. #50 (strategy-based provider selection) covers cost/latency/priority strategies and #49 covers multi-key round-robin for rate limit distribution. Even with both of those, the gateway lacks several production-grade routing primitives:

- **Request hedging.** Send the same request to two providers in parallel; use whichever responds first; cancel the other. Standard SRE pattern for tail-latency reduction. Today the gateway tries one provider at a time — if the first hangs near its timeout, the user waits the full timeout before fallback kicks in.
- **Request coalescing.** Two callers with identical `(messages, model, params)` arriving simultaneously result in two upstream API calls. Coalescing collapses them to one, fanning out the response. For batched evaluation harnesses (PhD's H4 testing the same prompt across many scenarios) this can halve cost.
- **Proactive health checks.** Today the circuit breaker only learns about provider health by attempting real requests. A provider that's been down for 30 minutes only gets retested when the next workflow happens to route to it. Periodic background probes detect recovery faster.
- **Admission control / backpressure.** Under burst load (1000 workflows started in the same second), all 1000 hit the gateway, all 1000 enqueue at the rate limiter, all 1000 wait indefinitely. There is no "reject if queue depth > N" — the system has no graceful degradation under overload.

### Proposal

Four independent gateway features that can ship in any order. Each is small enough for a single PR.

**1. Request hedging:**

```yaml
- id: critical_step
  type: llm_call
  prompt: "..."
  hedge:
    enabled: true
    delay_ms: 500          # wait this long before firing the second request
    max_parallel: 2        # most providers in race
```

Implementation: in `gateway.complete()`, when `hedge.enabled`, fire the primary request immediately and start a timer. After `delay_ms`, fire the second request to the next candidate. Race them via `anyio.move_on_after` + task group. First to return wins; the other is cancelled. Cancellation goes through the same `GeneratorExit` path as #106 — it must NOT count as a circuit breaker failure.

Default off — opt-in per step. Doubles cost when both win the race against the timer (rare but real).

**2. Request coalescing:**

```yaml
config:
  coalescing:
    enabled: true
    window_ms: 100         # requests within this window with identical key are coalesced
```

Implementation: in `gateway.complete()`, hash `(messages, model, temperature, max_tokens, kwargs)`. If an in-flight request with the same hash exists and started within `window_ms`, await its result instead of issuing a new request. Both callers receive the same `ProviderResponse`. Cost is attributed evenly across coalesced callers (or, simpler: to the first caller; document the choice).

Default off — opt-in via config. Useful only when AgentLoom processes are long-lived (servers, batch runs).

**3. Proactive health checks:**

```python
class ProviderEntry:
    health_check_interval_s: float = 60.0
    health_check_endpoint: str = "/health"   # or a cheap "ping" model call
```

A background task per provider periodically probes — sends a minimal request and records latency. On failure, increments a "passive failure" counter that, if it crosses a threshold, opens the circuit even without real workflow traffic. On success in OPEN state, transitions to HALF_OPEN faster than the time-based recovery would allow.

Stops probing if the workflow is paused (no active workflows for N minutes) — don't burn quota when idle.

**4. Admission control:**

```yaml
config:
  admission:
    max_queue_depth: 100     # per provider
    on_overflow: reject      # or "shed_oldest" / "block"
```

Implementation: rate limiter tracks pending acquires. If the count exceeds `max_queue_depth`, the next `acquire()` either:

- `reject`: raises `AdmissionRejectedError` immediately. Caller decides what to do (typically: fail fast).
- `shed_oldest`: cancels the oldest waiting request (it gets `AdmissionRejectedError`); admits the new one.
- `block`: existing behavior (wait indefinitely).

Default `block` for backward compat. Production deployments configure to `reject` to fail fast under overload instead of building up a multi-minute backlog.

### Scope

- `src/agentloom/providers/gateway.py` — hedging logic, coalescing logic, admission control wiring.
- `src/agentloom/resilience/rate_limiter.py` — admission control inside `acquire()`.
- `src/agentloom/resilience/health_checker.py` — new module with the periodic probe loop.
- `src/agentloom/core/models.py` — `HedgeConfig`, `CoalescingConfig`, `AdmissionConfig` on `WorkflowConfig` / `StepDefinition`.
- `src/agentloom/exceptions.py` — `AdmissionRejectedError`.
- `src/agentloom/observability/metrics.py` — counters: `agentloom_hedge_wins_total{primary_provider, winner_provider}`, `agentloom_coalesced_requests_total`, `agentloom_admission_rejections_total{provider, reason}`, `agentloom_health_checks_total{provider, status}`.

### Regression tests

For each feature:

- `test_hedge_returns_first_response`
- `test_hedge_cancels_loser_without_circuit_failure`
- `test_hedge_only_one_request_when_first_succeeds_before_delay`
- `test_coalesce_combines_identical_requests_within_window`
- `test_coalesce_does_not_combine_requests_with_different_kwargs`
- `test_coalesce_propagates_failure_to_all_waiters`
- `test_health_check_runs_periodically_when_workflows_active`
- `test_health_check_stops_when_idle`
- `test_health_check_failure_opens_circuit`
- `test_admission_reject_raises_immediately_when_queue_full`
- `test_admission_shed_oldest_cancels_oldest_waiter`
- `test_admission_block_existing_behavior_unchanged`

### Notes

- All four are **opt-in**. Defaults preserve current behavior.
- Hedging and coalescing pull in opposite directions: hedging spends to lower latency, coalescing spends less when load patterns allow. They can coexist (different scopes — per-step vs gateway-wide).
- Health checks burn API quota. Document the cost trade-off — useful for production, may be undesirable for short-lived CLI runs.
- Coalescing semantics need care: how to attribute cost? Suggest: to the first caller, log others as "coalesced=true". Iterate on this.
- Coordinates with #106 (gateway resilience fixes): hedge cancellation must use the `GeneratorExit`-aware path that #106 fixes, not the current path that erodes circuit health.
- Coordinates with #50 (strategy-based selection): hedging picks two from the candidate list ordered by the chosen strategy.
- Coordinates with #69 (distributed rate limiting): admission control extends naturally to a distributed setting (Redis-backed queue depth).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add gateway request hedging, coalescing, health checks, admission control #130

Description

Proposal

Scope

Regression tests

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

add gateway request hedging, coalescing, health checks, admission control #130

Description

Description

Proposal

Scope

Regression tests

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions