-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Open
Description
name: Bug Report
description: Report a bug or unexpected behavior
title: "[Bug]: VLM 用量异常:欠费后重试风暴导致 5 秒内 5405 次调用"
labels: ["bug", "retry-storm", "circuit-breaker", "cost-optimization"]
Bug Description
在使用 OpenViking Memory 系统时,遇到严重的重试风暴问题。当火山引擎 VLM 模型 API 返回 403 AccountOverdueError(账户欠费)后,系统未触发熔断机制,持续重试导致 5 秒内爆发 5405 次 API 调用,造成经济损失和资源浪费。
Steps to Reproduce
- 配置 OpenViking 使用火山引擎 VLM 模型(
doubao-seed-2-0-pro-260215) - 账户欠费后,API 返回
403 AccountOverdueError - OpenViking 持续重试,无熔断机制,请求队列积累
- 队列集中爆发,5 秒内产生 5405 次调用
Expected Behavior
- 熔断机制:连续 N 次(如 5 次)403 错误后,暂停调用并告警
- 指数退避重试:重试间隔应指数增长(1s → 2s → 4s → 8s...)
- 队列限制:限制重试队列大小,避免积累后爆发
- 错误分类处理:403 欠费错误不应无限重试
Actual Behavior
时间线(2026-03-18):
00:00-03:58:约 10 次正常调用03:58:49-03:58:53:5 秒内爆发 5405 次调用(火山引擎统计 5716 次)- 总计:5416 次调用
错误日志:
AccountOverdueError: 403 Forbidden
Request ID: 0217738016782801549c1a8259f4ab1e99092b4a0e368272c9c39
调用来源:
memory_deduplicator(记忆去重)collection_schemas(Embedding 处理)semantic_dag(语义摘要)
影响:
- 经济损失:5716 次 VLM 调用导致账户欠费
- 资源浪费:无效请求占用 API 配额
- 服务不稳定:重试风暴可能影响其他正常请求
Minimal Reproducible Example
# 配置示例
{
"vlm": {
"provider": "volcengine",
"api_key": "your-api-key",
"model": "doubao-seed-2-0-pro-260215",
"api_base": "https://ark.cn-beijing.volces.com/api/coding/v3"
}
}
# 触发场景
# 1. 账户欠费
# 2. 调用 VLM API(记忆去重/语义处理)
# 3. 返回 403 错误
# 4. 系统无限重试,无熔断Error Logs
2026-03-18 03:58:49 | ERROR | openviking.vlm.client:call_api:123 - AccountOverdueError: 403 Forbidden
2026-03-18 03:58:49 | WARNING | openviking.retry:enqueue:45 - Retry queue size: 1000
2026-03-18 03:58:50 | WARNING | openviking.retry:enqueue:45 - Retry queue size: 2500
2026-03-18 03:58:51 | WARNING | openviking.retry:enqueue:45 - Retry queue size: 4200
2026-03-18 03:58:52 | ERROR | openviking.vlm.client:call_api:123 - AccountOverdueError: 403 Forbidden
2026-03-18 03:58:53 | CRITICAL | openviking.retry:flush:89 - Flushing 5405 retry requests
OpenViking Version
0.2.6
Python Version
3.10.14 (OpenViking venv: /opt/openviking/venv)
Operating System
Linux
Model Backend
Volcengine (Doubao)
Additional Context
建议修复方案
1. 熔断机制
class CircuitBreaker:
def __init__(self, failure_threshold=5, reset_timeout=300):
self.failure_count = 0
self.failure_threshold = failure_threshold
self.reset_timeout = reset_timeout
self.last_failure_time = None
self.state = "CLOSED" # CLOSED, OPEN, HALF_OPEN
def record_failure(self, error_code):
# 403 欠费错误直接打开熔断器
if error_code == 403:
self.state = "OPEN"
self.last_failure_time = time.time()
raise CircuitBreakerOpenError("Account overdue, circuit breaker opened")
self.failure_count += 1
if self.failure_count >= self.failure_threshold:
self.state = "OPEN"
self.last_failure_time = time.time()2. 指数退避重试
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
@retry(
stop=stop_after_attempt(5),
wait=wait_exponential(multiplier=1, min=1, max=60),
retry=retry_if_exception_type((RateLimitError, TimeoutError)),
reraise=True
)
def call_vlm_api(...):
# 403 错误不重试
if response.status_code == 403:
raise AccountOverdueError(response.text)临时解决方案
- 充值账户:立即充值恢复服务
- 重启服务:
sudo systemctl restart openviking清空重试队列 - 配置告警:火山引擎控制台设置余额告警
- 限制 VLM 使用:仅复杂任务启用 VLM,日常搜索仅用 Embedding
参考实现
- Hystrix 熔断器:https://github.com/Netflix/Hystrix
- resilience4j:https://github.com/resilience4j/resilience4j
- 火山引擎文档:https://www.volcengine.com/docs/82379/1330310
优先级: 🔴 P0 - 严重(造成经济损失)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
Backlog