Skip to content

add soft budget alerts and semantic-aware retry classification #132

@cchinchilla-dev

Description

@cchinchilla-dev

Description

Two related improvements to budget and retry handling that #12 (pre-flight budget estimation) does not address:

1. Soft budget alerts.

Today, BudgetExceededError fires only at 100% — an abrupt hard stop with no advance warning. Production workflows benefit from progressive signals:

  • 50%: nice to know, log only.
  • 80%: notify monitoring system, prepare for degraded operation.
  • 100%: hard stop (current behavior).

Without soft alerts, a workflow that's burning through budget faster than expected has no chance to react — switch to a cheaper provider, abort early, page the operator.

2. Semantic-aware retry.

core/engine.py:535-549 retries on any exception. Better behavior depends on the exception class:

Exception Should retry? Why
RateLimitError (429) Yes, with Retry-After Provider explicitly says retry
ProviderError(status_code=5xx) Yes, exponential backoff Transient server error
ProviderError(status_code=4xx) (non-429) No Client error; retry won't help
httpx.ConnectError Yes Network blip
TimeoutError Yes Could be transient
BudgetExceededError No Retrying just spends more money
CircuitOpenError No (by current provider) But fallback may try another
Content policy violation (varies by provider) No Same prompt → same violation
ValidationError (per #117) Yes with feedback Prompt may be retryable with error context

Today, content policy violations and 4xx errors silently retry up to max_retries times — wasting cost and time, sometimes hitting the model's same refusal repeatedly. Same for budget errors.

Proposal

1. Soft budget alerts:

Extend WorkflowConfig:

config:
  budget_usd: 10.00
  budget_alerts:
    - threshold: 0.5
      action: log
      level: info
    - threshold: 0.8
      action: webhook
      url: "{env.BUDGET_ALERT_WEBHOOK}"
    - threshold: 0.95
      action: hook
      callback: examples.callbacks.switch_to_cheap_provider

Actions:

  • log — emit a structured log entry (warning).
  • webhook — POST to a URL with {workflow_name, run_id, threshold, spent_usd, budget_usd, remaining_usd}.
  • hook — invoke a Python callback (registered like tools) with the same payload — can mutate state, switch provider, abort.
  • metric — emit a counter (always done regardless of action).

Threshold check fires after each step's cost commits. Each threshold fires at most once per workflow run.

2. Semantic-aware retry:

Add classification layer on top of the existing retry policy:

config:
  retry:
    max_retries: 3
    backoff_base: 2.0
    backoff_max: 60.0
    classification:
      retryable: [RateLimitError, ProviderError5xx, ConnectError, TimeoutError]
      non_retryable: [BudgetExceededError, ProviderError4xxNon429, ContentPolicyError]
      retry_with_feedback: [ValidationError, JSONParseError]

Engine retry loop classifies each exception:

  • Retryable: backoff + retry as today.
  • Non-retryable: fail fast, no retries even if max_retries > 0.
  • Retry with feedback: append the error message to the prompt as system feedback, retry. (For schema validation failures, this is "your previous response failed validation: ; please retry conforming to the schema.")

For RateLimitError specifically, honor Retry-After header (per #109 semantics):

if isinstance(e, RateLimitError) and e.retry_after_s:
    backoff = e.retry_after_s
    # but cap at backoff_max to avoid hour-long waits
    backoff = min(backoff, policy.backoff_max)

3. Default classification (sane defaults, no config required):

Built into the framework; users override only if they need custom behavior:

DEFAULT_CLASSIFICATION = RetryClassification(
    retryable={RateLimitError, ConnectError, TimeoutError},
    retryable_status_codes={500, 502, 503, 504},
    non_retryable={BudgetExceededError, CircuitOpenError, ContentPolicyError},
    non_retryable_status_codes={400, 401, 403, 404, 422},  # 429 is in retryable
    retry_with_feedback={ValidationError, JSONParseError},
)

4. Observability:

  • New counter: agentloom_budget_alerts_total{workflow, threshold, action}.
  • New counter: agentloom_retry_classified_total{exception_class, classification}.
  • New webhook event type for budget alerts (extends existing webhook infrastructure).

Scope

  • src/agentloom/core/models.pyBudgetAlert, RetryClassification configs.
  • src/agentloom/core/engine.py — alert dispatch after each commit; retry classification in the retry loop (or in retry_with_policy after fix gateway resilience: CB/RL ordering, stream cancellation, retry jitter, rate-limiter edge cases #106 wires it up).
  • src/agentloom/resilience/retry.py — extend RetryPolicy with classification; default classifications.
  • src/agentloom/exceptions.pyContentPolicyError, JSONParseError, BudgetExceededError classification metadata.
  • src/agentloom/webhooks/sender.py — budget alert payload type.
  • src/agentloom/observability/metrics.py — alert + retry classification counters.

Regression tests

For alerts:

  • test_budget_alert_log_fires_at_threshold
  • test_budget_alert_webhook_posts_payload
  • test_budget_alert_hook_invokes_callback
  • test_budget_alert_each_threshold_fires_once
  • test_budget_alert_metric_recorded

For retry classification:

  • test_rate_limit_error_honored_retry_after_header
  • test_4xx_non_429_not_retried
  • test_budget_exceeded_not_retried
  • test_5xx_retried_with_backoff
  • test_validation_error_retried_with_feedback_in_prompt
  • test_content_policy_violation_not_retried
  • test_custom_classification_overrides_defaults

Notes

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestresilienceCircuit breaker, retry, rate limiter

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions