You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Two related improvements to budget and retry handling that #12 (pre-flight budget estimation) does not address:
1. Soft budget alerts.
Today, BudgetExceededError fires only at 100% — an abrupt hard stop with no advance warning. Production workflows benefit from progressive signals:
50%: nice to know, log only.
80%: notify monitoring system, prepare for degraded operation.
100%: hard stop (current behavior).
Without soft alerts, a workflow that's burning through budget faster than expected has no chance to react — switch to a cheaper provider, abort early, page the operator.
2. Semantic-aware retry.
core/engine.py:535-549 retries on any exception. Better behavior depends on the exception class:
Today, content policy violations and 4xx errors silently retry up to max_retries times — wasting cost and time, sometimes hitting the model's same refusal repeatedly. Same for budget errors.
Non-retryable: fail fast, no retries even if max_retries > 0.
Retry with feedback: append the error message to the prompt as system feedback, retry. (For schema validation failures, this is "your previous response failed validation: ; please retry conforming to the schema.")
For RateLimitError specifically, honor Retry-After header (per #109 semantics):
ifisinstance(e, RateLimitError) ande.retry_after_s:
backoff=e.retry_after_s# but cap at backoff_max to avoid hour-long waitsbackoff=min(backoff, policy.backoff_max)
3. Default classification (sane defaults, no config required):
Built into the framework; users override only if they need custom behavior:
Default classification must be conservative — current users implicitly retry everything; switching some classes to non-retryable is technically a behavior change but the change is "stop wasting time on errors that won't recover," which is improvement, not regression.
Description
Two related improvements to budget and retry handling that #12 (pre-flight budget estimation) does not address:
1. Soft budget alerts.
Today,
BudgetExceededErrorfires only at 100% — an abrupt hard stop with no advance warning. Production workflows benefit from progressive signals:Without soft alerts, a workflow that's burning through budget faster than expected has no chance to react — switch to a cheaper provider, abort early, page the operator.
2. Semantic-aware retry.
core/engine.py:535-549retries on any exception. Better behavior depends on the exception class:RateLimitError(429)Retry-AfterProviderError(status_code=5xx)ProviderError(status_code=4xx)(non-429)httpx.ConnectErrorTimeoutErrorBudgetExceededErrorCircuitOpenErrorValidationError(per #117)Today, content policy violations and 4xx errors silently retry up to
max_retriestimes — wasting cost and time, sometimes hitting the model's same refusal repeatedly. Same for budget errors.Proposal
1. Soft budget alerts:
Extend
WorkflowConfig:Actions:
log— emit a structured log entry (warning).webhook— POST to a URL with{workflow_name, run_id, threshold, spent_usd, budget_usd, remaining_usd}.hook— invoke a Python callback (registered like tools) with the same payload — can mutate state, switch provider, abort.metric— emit a counter (always done regardless of action).Threshold check fires after each step's cost commits. Each threshold fires at most once per workflow run.
2. Semantic-aware retry:
Add classification layer on top of the existing retry policy:
Engine retry loop classifies each exception:
max_retries > 0.For
RateLimitErrorspecifically, honorRetry-Afterheader (per #109 semantics):3. Default classification (sane defaults, no config required):
Built into the framework; users override only if they need custom behavior:
4. Observability:
agentloom_budget_alerts_total{workflow, threshold, action}.agentloom_retry_classified_total{exception_class, classification}.Scope
src/agentloom/core/models.py—BudgetAlert,RetryClassificationconfigs.src/agentloom/core/engine.py— alert dispatch after each commit; retry classification in the retry loop (or inretry_with_policyafter fix gateway resilience: CB/RL ordering, stream cancellation, retry jitter, rate-limiter edge cases #106 wires it up).src/agentloom/resilience/retry.py— extendRetryPolicywith classification; default classifications.src/agentloom/exceptions.py—ContentPolicyError,JSONParseError,BudgetExceededErrorclassification metadata.src/agentloom/webhooks/sender.py— budget alert payload type.src/agentloom/observability/metrics.py— alert + retry classification counters.Regression tests
For alerts:
test_budget_alert_log_fires_at_thresholdtest_budget_alert_webhook_posts_payloadtest_budget_alert_hook_invokes_callbacktest_budget_alert_each_threshold_fires_oncetest_budget_alert_metric_recordedFor retry classification:
test_rate_limit_error_honored_retry_after_headertest_4xx_non_429_not_retriedtest_budget_exceeded_not_retriedtest_5xx_retried_with_backofftest_validation_error_retried_with_feedback_in_prompttest_content_policy_violation_not_retriedtest_custom_classification_overrides_defaultsNotes
retry_with_policyis wired in (per fix gateway resilience: CB/RL ordering, stream cancellation, retry jitter, rate-limiter edge cases #106), classification fits naturally inside it.RateLimitErrorwithretry_after_sis the bridge.ValidationErrorfor failed schema parsing is the canonical "retry with feedback" case.