add soft budget alerts and semantic-aware retry classification

### Description

Two related improvements to budget and retry handling that #12 (pre-flight budget estimation) does not address:

**1. Soft budget alerts.**

Today, `BudgetExceededError` fires only at 100% — an abrupt hard stop with no advance warning. Production workflows benefit from progressive signals:

- 50%: nice to know, log only.
- 80%: notify monitoring system, prepare for degraded operation.
- 100%: hard stop (current behavior).

Without soft alerts, a workflow that's burning through budget faster than expected has no chance to react — switch to a cheaper provider, abort early, page the operator.

**2. Semantic-aware retry.**

`core/engine.py:535-549` retries on **any** exception. Better behavior depends on the exception class:

| Exception | Should retry? | Why |
|---|---|---|
| `RateLimitError` (429) | Yes, with `Retry-After` | Provider explicitly says retry |
| `ProviderError(status_code=5xx)` | Yes, exponential backoff | Transient server error |
| `ProviderError(status_code=4xx)` (non-429) | **No** | Client error; retry won't help |
| `httpx.ConnectError` | Yes | Network blip |
| `TimeoutError` | Yes | Could be transient |
| `BudgetExceededError` | **No** | Retrying just spends more money |
| `CircuitOpenError` | **No** (by current provider) | But fallback may try another |
| Content policy violation (varies by provider) | **No** | Same prompt → same violation |
| `ValidationError` (per #117) | **Yes** with feedback | Prompt may be retryable with error context |

Today, content policy violations and 4xx errors silently retry up to `max_retries` times — wasting cost and time, sometimes hitting the model's same refusal repeatedly. Same for budget errors.

### Proposal

**1. Soft budget alerts:**

Extend `WorkflowConfig`:

```yaml
config:
  budget_usd: 10.00
  budget_alerts:
    - threshold: 0.5
      action: log
      level: info
    - threshold: 0.8
      action: webhook
      url: "{env.BUDGET_ALERT_WEBHOOK}"
    - threshold: 0.95
      action: hook
      callback: examples.callbacks.switch_to_cheap_provider
```

Actions:

- `log` — emit a structured log entry (warning).
- `webhook` — POST to a URL with `{workflow_name, run_id, threshold, spent_usd, budget_usd, remaining_usd}`.
- `hook` — invoke a Python callback (registered like tools) with the same payload — can mutate state, switch provider, abort.
- `metric` — emit a counter (always done regardless of action).

Threshold check fires after each step's cost commits. Each threshold fires at most once per workflow run.

**2. Semantic-aware retry:**

Add classification layer on top of the existing retry policy:

```yaml
config:
  retry:
    max_retries: 3
    backoff_base: 2.0
    backoff_max: 60.0
    classification:
      retryable: [RateLimitError, ProviderError5xx, ConnectError, TimeoutError]
      non_retryable: [BudgetExceededError, ProviderError4xxNon429, ContentPolicyError]
      retry_with_feedback: [ValidationError, JSONParseError]
```

Engine retry loop classifies each exception:

- **Retryable**: backoff + retry as today.
- **Non-retryable**: fail fast, no retries even if `max_retries > 0`.
- **Retry with feedback**: append the error message to the prompt as system feedback, retry. (For schema validation failures, this is "your previous response failed validation: <error>; please retry conforming to the schema.")

For `RateLimitError` specifically, honor `Retry-After` header (per #109 semantics):

```python
if isinstance(e, RateLimitError) and e.retry_after_s:
    backoff = e.retry_after_s
    # but cap at backoff_max to avoid hour-long waits
    backoff = min(backoff, policy.backoff_max)
```

**3. Default classification (sane defaults, no config required):**

Built into the framework; users override only if they need custom behavior:

```python
DEFAULT_CLASSIFICATION = RetryClassification(
    retryable={RateLimitError, ConnectError, TimeoutError},
    retryable_status_codes={500, 502, 503, 504},
    non_retryable={BudgetExceededError, CircuitOpenError, ContentPolicyError},
    non_retryable_status_codes={400, 401, 403, 404, 422},  # 429 is in retryable
    retry_with_feedback={ValidationError, JSONParseError},
)
```

**4. Observability:**

- New counter: `agentloom_budget_alerts_total{workflow, threshold, action}`.
- New counter: `agentloom_retry_classified_total{exception_class, classification}`.
- New webhook event type for budget alerts (extends existing webhook infrastructure).

### Scope

- `src/agentloom/core/models.py` — `BudgetAlert`, `RetryClassification` configs.
- `src/agentloom/core/engine.py` — alert dispatch after each commit; retry classification in the retry loop (or in `retry_with_policy` after #106 wires it up).
- `src/agentloom/resilience/retry.py` — extend `RetryPolicy` with classification; default classifications.
- `src/agentloom/exceptions.py` — `ContentPolicyError`, `JSONParseError`, `BudgetExceededError` classification metadata.
- `src/agentloom/webhooks/sender.py` — budget alert payload type.
- `src/agentloom/observability/metrics.py` — alert + retry classification counters.

### Regression tests

For alerts:

- `test_budget_alert_log_fires_at_threshold`
- `test_budget_alert_webhook_posts_payload`
- `test_budget_alert_hook_invokes_callback`
- `test_budget_alert_each_threshold_fires_once`
- `test_budget_alert_metric_recorded`

For retry classification:

- `test_rate_limit_error_honored_retry_after_header`
- `test_4xx_non_429_not_retried`
- `test_budget_exceeded_not_retried`
- `test_5xx_retried_with_backoff`
- `test_validation_error_retried_with_feedback_in_prompt`
- `test_content_policy_violation_not_retried`
- `test_custom_classification_overrides_defaults`

### Notes

- Pairs with #12 (pre-flight estimation): forecasted cost can drive earlier alerts ("at this rate you'll exceed budget in N steps").
- Pairs with #106 (retry jitter): once `retry_with_policy` is wired in (per #106), classification fits naturally inside it.
- Pairs with #109 (provider 429 normalization): `RateLimitError` with `retry_after_s` is the bridge.
- Pairs with #117 (structured output): `ValidationError` for failed schema parsing is the canonical "retry with feedback" case.
- Default classification must be conservative — current users implicitly retry everything; switching some classes to non-retryable is technically a behavior change but the change is "stop wasting time on errors that won't recover," which is improvement, not regression.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add soft budget alerts and semantic-aware retry classification #132

Description

Proposal

Scope

Regression tests

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Exception	Should retry?	Why
`RateLimitError` (429)	Yes, with `Retry-After`	Provider explicitly says retry
`ProviderError(status_code=5xx)`	Yes, exponential backoff	Transient server error
`ProviderError(status_code=4xx)` (non-429)	No	Client error; retry won't help
`httpx.ConnectError`	Yes	Network blip
`TimeoutError`	Yes	Could be transient
`BudgetExceededError`	No	Retrying just spends more money
`CircuitOpenError`	No (by current provider)	But fallback may try another
Content policy violation (varies by provider)	No	Same prompt → same violation
`ValidationError` (per #117)	Yes with feedback	Prompt may be retryable with error context

add soft budget alerts and semantic-aware retry classification #132

Description

Description

Proposal

Scope

Regression tests

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions