Skip to content

fix(queue): add circuit breaker to prevent API retry storms#772

Open
deepakdevp wants to merge 4 commits intovolcengine:mainfrom
deepakdevp:fix/vlm-retry-storm-circuit-breaker
Open

fix(queue): add circuit breaker to prevent API retry storms#772
deepakdevp wants to merge 4 commits intovolcengine:mainfrom
deepakdevp:fix/vlm-retry-storm-circuit-breaker

Conversation

@deepakdevp
Copy link
Contributor

Summary

  • Adds a thread-safe CircuitBreaker utility that trips after consecutive failures (or immediately on permanent errors like 403/401) and blocks further API calls until a cooldown period elapses
  • Integrates the circuit breaker into SemanticProcessor and TextEmbeddingHandler queue handlers with proper error classification: permanent errors (403/401) drop the message, transient errors (429/5xx/timeout) re-enqueue for later retry
  • Prevents the retry storm described in [Bug]: VLM 用量异常:欠费后重试风暴导致 5 秒内 5405 次调用 #729 where 5405 VLM API calls happened in 5 seconds after a 403 AccountOverdue error

Fixes #729

Root Cause

When the VLM/embedding API returns an unrecoverable error (e.g., 403 AccountOverdue), queue handlers had no error classification or circuit breaking. Many distinct messages in the queue each independently hit the same broken API endpoint, generating thousands of wasted calls before the queue drained.

Changes Made

  • openviking/utils/circuit_breaker.py (new): CircuitBreaker class with CLOSED/OPEN/HALF_OPEN states + classify_api_error() that distinguishes permanent vs transient errors + CircuitBreakerOpen exception
  • openviking/storage/queuefs/semantic_processor.py: Added breaker guard before processing, error classification in the except block, and throttled re-enqueue for transient errors
  • openviking/storage/collection_schemas.py: Replaced is_429_error with classify_api_error(), added breaker guard before embedding, extended re-enqueue from 429-only to all transient errors
  • tests/utils/test_circuit_breaker.py (new): 14 unit tests covering state transitions, error classification, thread safety, and edge cases

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

Testing

  • 14 unit tests pass locally
  • Ruff lint + format clean
  • Tested on macOS with Python 3.14

🤖 Generated with Claude Code

deepakdevp and others added 4 commits March 19, 2026 18:55
…ection

Adds a thread-safe CircuitBreaker with three states (CLOSED/OPEN/HALF_OPEN)
and classify_api_error() that distinguishes permanent (403/401) from transient
(429/5xx/timeout) errors. Permanent errors trip the breaker immediately.

Part of fix for volcengine#729.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
on_dequeue() now checks a circuit breaker before processing. Permanent
API errors (403/401) trip the breaker immediately and drop the message.
Transient errors (429/5xx/timeout) re-enqueue the message for later
retry. When the breaker is open, messages are re-enqueued with a
throttled sleep to prevent re-enqueue storms.

Part of fix for volcengine#729.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces 429-only error check with classify_api_error() that handles
permanent (403/401), transient (429/5xx/timeout), and unknown errors.
Permanent errors trip the breaker and drop the message. Transient errors
re-enqueue for retry (extending existing 429 behavior to all transient
errors). Circuit breaker check before embedding prevents calling a
known-broken API.

Part of fix for volcengine#729.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…integration

- Restore top-level `import asyncio` in collection_schemas.py (was
  accidentally removed, breaking all embedding operations)
- Return error instead of falling through when breaker is open and no
  queue manager is available
- Log warning when queue_manager is None in _reenqueue_semantic_msg
- Only call report_success() when msg was actually re-enqueued, not
  when msg is None

Part of fix for volcengine#729.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

[Bug]: VLM 用量异常:欠费后重试风暴导致 5 秒内 5405 次调用

2 participants