add MockProvider fault modes for deterministic failure replay

### Description

`MockProvider` (`src/agentloom/providers/mock.py`) replays pre-recorded successful responses. There is no way to make it return a failure (timeout, 5xx, rate-limit response, malformed JSON, partial stream). #62 (chaos/fault injection testing mode) addresses fault injection at the gateway level — random/probabilistic failure injection for stress tests. That is complementary but does not solve the same problem.

The PhD's Simulator (per `agenttest-planteamiento.md`) includes a **fault injection mode** that simulates an agent failing — timeout, invalid response, crash — to measure cascading failures and recovery. For this to work in a reproducible test harness, faults must be **deterministic and replayable**, not probabilistic. A failed scenario must produce the same failure on every run, just like a successful scenario produces the same response on every run.

Today the only way to test failure handling deterministically is to mock at the httpx layer per test — outside AgentLoom, bypassing the gateway, brittle.

### Proposal

Extend `MockProvider`'s response file format to support fault declarations, and add gateway-level integration so faults flow through the same code paths as real failures (circuit breaker, retry, fallback chain).

**1. Extended response file format:**

```json
{
  "step_classify": {
    "content": "question",
    "model": "gpt-4o-mini",
    "usage": {"prompt_tokens": 10, "completion_tokens": 1, "total_tokens": 11},
    "cost_usd": 0.0001
  },
  "step_answer": {
    "fault": {
      "type": "timeout",
      "after_ms": 5000
    }
  },
  "step_summarize": {
    "fault": {
      "type": "http_error",
      "status_code": 429,
      "headers": {"Retry-After": "30"},
      "body": "Rate limit exceeded"
    }
  },
  "step_explain": {
    "fault": {
      "type": "http_error",
      "status_code": 500,
      "body": "Internal server error"
    }
  },
  "step_translate": {
    "fault": {
      "type": "malformed_response",
      "raw": "not valid json"
    }
  },
  "step_stream_long": {
    "fault": {
      "type": "stream_truncate",
      "after_chunks": 3,
      "raise": "ConnectionError"
    }
  }
}
```

**2. Fault types:**

| Type | Effect |
|---|---|
| `timeout` | `await anyio.sleep(after_ms)` then raise `TimeoutError` |
| `http_error` | Raise `ProviderError(status_code=N)` mimicking provider HTTP error; `RateLimitError` when 429 |
| `malformed_response` | Return invalid bytes that fail to parse — exercises adapter error paths |
| `stream_truncate` | Yield N chunks then raise specified exception mid-stream — exercises gateway stream cancellation logic (#106) |
| `connection_reset` | Immediate `httpx.ConnectError` |
| `partial_response` | Returns content but with `usage.completion_tokens=0` and `finish_reason="length"` — exercises usage parsing |

**3. Determinism:**

Each step keyed by `step_id` (same as success replay). The same fault fires every time the workflow runs that step. No probability — that's #62's domain. This one is "scenario X always times out at iteration 2 of the agent loop."

**4. Composition with success replay:**

A workflow can mix faults and successes in the same run by keying the response file by step_id:

```json
{
  "first_attempt": {"fault": {"type": "http_error", "status_code": 500}},
  "second_attempt": {"content": "success after retry"}
}
```

Combined with the workflow's retry policy, this exercises the full retry-after-failure path deterministically.

**5. Stream fault support:**

For `stream_truncate`, MockProvider's `stream()` method (today missing — see #107) yields N chunks from the recorded `content` then raises. This validates that the gateway's `_wrapped_iter` correctly distinguishes consumer cancellation from provider failure (#106's fix).

**6. Observability:**

When a fault fires, MockProvider increments a counter:

```
agentloom_mock_fault_total{workflow, step_id, fault_type}
```

And sets a span attribute `mock.fault_type` so test runs are visible in traces.

### Scope

- `src/agentloom/providers/mock.py` — extended response file format, fault dispatch.
- `src/agentloom/providers/mock.py::stream` — implement streaming with fault support (depends on / coordinates with #107).
- `src/agentloom/observability/metrics.py` — new mock fault counter.
- `tests/providers/test_mock.py` — comprehensive fault coverage.
- `examples/fault_replay.yaml` — example workflow + response file showing each fault type.
- `docs/` — fault scenarios chapter in record/replay docs.

### Regression tests

- `test_mock_fault_timeout_raises_timeout_error`
- `test_mock_fault_http_error_429_raises_rate_limit_error`
- `test_mock_fault_http_error_500_raises_provider_error`
- `test_mock_fault_malformed_response_fails_adapter_parsing`
- `test_mock_fault_stream_truncate_raises_mid_stream`
- `test_mock_fault_connection_reset_immediate`
- `test_mock_fault_partial_response_finish_reason_length`
- `test_mock_fault_in_workflow_triggers_retry_policy`
- `test_mock_fault_propagates_to_circuit_breaker`
- `test_mock_fault_metric_recorded`

### Notes

- Complements #62 — that issue covers probabilistic chaos for stress testing; this one covers deterministic faults for reproducible regression testing.
- Coordinates with #107 (record/replay streaming) — both touch MockProvider's streaming path.
- Coordinates with #106 (gateway resilience) — fault types are designed to exercise the specific failure paths that #106 fixes.
- Once #116 (tool calling) lands, add `tool_call_fault` — model returns a tool call that the workflow can mock as failing.
- This unblocks the Simulator's "Fault injection" mode by giving it a deterministic failure source to record/replay scenarios against.


Type	Effect
`timeout`	`await anyio.sleep(after_ms)` then raise `TimeoutError`
`http_error`	Raise `ProviderError(status_code=N)` mimicking provider HTTP error; `RateLimitError` when 429
`malformed_response`	Return invalid bytes that fail to parse — exercises adapter error paths
`stream_truncate`	Yield N chunks then raise specified exception mid-stream — exercises gateway stream cancellation logic (#106)
`connection_reset`	Immediate `httpx.ConnectError`
`partial_response`	Returns content but with `usage.completion_tokens=0` and `finish_reason="length"` — exercises usage parsing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add MockProvider fault modes for deterministic failure replay #123

Description

Proposal

Scope

Regression tests

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

add MockProvider fault modes for deterministic failure replay #123

Description

Description

Proposal

Scope

Regression tests

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions