Skip to content

feat: add retry and circuit breaker utilities for LLM calls (mitigates #172)#216

Open
Jah-yee wants to merge 2 commits intoHKUDS:mainfrom
Jah-yee:feat-resilience-llm-calls
Open

feat: add retry and circuit breaker utilities for LLM calls (mitigates #172)#216
Jah-yee wants to merge 2 commits intoHKUDS:mainfrom
Jah-yee:feat-resilience-llm-calls

Conversation

@Jah-yee
Copy link

@Jah-yee Jah-yee commented Mar 2, 2026

Summary

This PR adds a small raganything.resilience module with reusable retry and circuit breaker helpers for LLM API calls, so long-running document processing can better tolerate transient network issues.

Motivation

As discussed in #172, process_document_complete and similar flows can get stuck when LLM calls intermittently fail: there is no retry/backoff strategy and no circuit breaker to prevent cascading failures. A focused resilience layer makes it easier to harden these call sites without pulling in extra dependencies.

Changes

  • raganything/resilience.py
    • @retry decorator for synchronous functions: exponential backoff with optional jitter, detection of common transient exceptions (httpx, OpenAI clients, generic network errors), configurable attempts/delays, optional on_retry callback.
    • @async_retry decorator with the same semantics for async functions.
    • CircuitBreaker class: tracks consecutive failures, opens at a configurable threshold, uses half-open trials to recover automatically when the upstream stabilizes. In-memory and dependency-free.
  • tests/test_resilience.py
    • Tests for sync/async retry behavior, jitter/backoff ranges, on_retry callback invocation, circuit breaker state transitions (closed → open → half-open → closed), and error handling.

Testing

  • Ran pytest locally including tests/test_resilience.py; all tests passed.
  • Manually simulated intermittent failures to confirm retries, backoff, and breaker recovery behave as expected.

Thanks for your work on RAG-Anything—if you’d like different defaults or naming for these helpers, I’m happy to revise the PR to match your preferences.

@LarFii
Copy link
Collaborator

LarFii commented Mar 4, 2026

Thanks for the resilience utilities work. I did a Codex-assisted review and found two blocking behavior issues to fix before merge:

  1. Half-open state does not enforce a single trial call (High)

    • In CircuitBreaker, rejection only happens when state == "open". After timeout, state becomes "half-open", but there is no gate to limit admission.
    • Result: multiple concurrent calls can pass through during half-open, which conflicts with the docstring (“allows one trial call through”) and reduces protection during recovery.
    • Suggested fix: add half-open single-flight control (e.g., lock + trial_in_flight flag, or semaphore=1). While one probe is in flight, additional calls should be rejected with CircuitBreakerOpen.
    • Please add a concurrency test to verify only one half-open trial is admitted.
  2. Invalid max_attempts produces misleading runtime failure (Medium)

    • In both retry and async_retry, if max_attempts <= 0, the retry loop is skipped and execution reaches raise last_exception with last_exception is None, causing TypeError: exceptions must derive from BaseException.
    • Suggested fix: validate parameters at decorator creation (max_attempts >= 1, non-negative delays, sensible bounds) and raise clear ValueError for invalid configs.
    • Please add tests for invalid parameter values.

@Jah-yee
Copy link
Author

Jah-yee commented Mar 4, 2026

Thanks a lot for the detailed Codex-assisted review on the resilience utilities — both issues you highlighted were very helpful.

Half-open single-flight behaviour

  • now enforces a true single trial call when transitioning to half-open:
  • state and counters are guarded by a , with a flag.
  • once the breaker is half-open, the first caller is admitted and marked as the in-flight trial, and any concurrent calls while that trial is running are rejected with .
  • on success or failure, the trial flag and state are reset appropriately ( on success, on failure).
  • I added a concurrency test () that spawns several threads in the half-open window and asserts exactly one successful trial execution and that the remaining callers receive .

Retry parameter validation

  • Both and now validate their configuration at decorator creation time:
    • is required.
    • and must be non-negative.
    • must be strictly greater than 0.
  • This prevents the path and instead raises a clear when the configuration is invalid.
  • has been extended with tests covering invalid and delay values for both sync and async decorators.

If you would like different defaults or error types for misconfiguration, I am happy to tweak them to match your preferred style.

@Jah-yee
Copy link
Author

Jah-yee commented Mar 4, 2026

Thanks a lot for the detailed Codex-assisted review on the resilience utilities — both issues were very helpful.

  • Half-open single-flight behaviour: CircuitBreaker now enforces a true single trial call when transitioning to half-open, using a threading.Lock and a _trial_in_flight flag. The first caller after timeout is admitted as the trial; concurrent callers while the trial is in progress are rejected with CircuitBreakerOpen. On success or failure the state and flag are reset appropriately. A new test (test_half_open_allows_single_trial_call) covers this behaviour under concurrency.

  • Retry parameter validation: both retry and async_retry now validate their configuration at decorator creation time (max_attempts >= 1, non-negative base_delay/max_delay, exponential_base > 0) and raise a clear ValueError for invalid configs. tests/test_resilience.py has been extended with tests for invalid max_attempts and delay values for both sync and async decorators.

If you prefer different defaults or error types for misconfiguration, I’m happy to tweak them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants