降低 ops_error_logs 与 scheduler_outbox 的数据库写放大#936
Merged
Wei-Shaw merged 2 commits intoWei-Shaw:mainfrom Mar 12, 2026
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
背景
线上排查里,数据库主要压力已经不只是业务主表,还包括两条高频写路径:
ops_error_logs在错误风暴时按条落库,虽然已经异步化,但仍然是逐条INSERTscheduler_outbox会被很多幂等状态变化反复写入,尤其同一个账号/分组在短时间内连续触发同类事件时,会放大 outbox 表和后续消费压力这次 PR 的目标是先把这两条路径的写放大收下来,同时尽量不改业务语义、不改接口协议、不改调度逻辑口径。
改动内容
1.
ops_error_logs改为批量写入OpsErrorLogger的 worker 不再逐条调用RecordErrorRecordErrorBatchOpsService增加批量准备和批量落库入口,复用单条日志原有的清洗逻辑opsRepository增加BatchInsertErrorLogs,使用单事务批量写入2.
scheduler_outbox对幂等事件做短窗口去重只对这三类幂等事件启用 1 秒窗口内的去重:
account_changedgroup_changedfull_rebuild实现方式不是进程内缓存,而是数据库侧
INSERT ... SELECT ... WHERE NOT EXISTS (...),避免多实例或事务回滚时出现本地去重误判。明确不去重的事件仍保持原样:
account_last_usedaccount_groups_changedaccount_bulk_changed也就是只收紧“重复写但最终效果等价”的事件,不碰带 payload 语义的事件。
为什么这样做
ops_error_logs批量写入可以减少事务数和 WAL/fsync 压力,但不减少日志条数,不影响错误排查和统计语义。scheduler_outbox的幂等事件本来就是“通知调度器去读最新状态并重建快照”,同一账号/分组在 1 秒窗口内反复写同类事件,保留第一条就足够让消费器拿到最终状态。accounts.extra的业务规则,也没有去动usage_logs主链路,是为了控制回归面。风险控制
ops_error_logs的单条清洗规则。测试
本地静态与单测
docker run --rm -v /home/ius/sub2api/backend:/src -w /src golangci/golangci-lint:v2.9.0 golangci-lint run --timeout=30mdocker run --rm -v /home/ius/sub2api/backend:/src -w /src golang:1.26.1 /usr/local/go/bin/go test -tags=unit ./...integration
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock -v /home/ius/sub2api/backend:/src -w /src golang:1.26.1 /usr/local/go/bin/go test -tags=integration ./internal/repository -count=1新增覆盖:
ops_error_logs批量插入scheduler_outbox幂等事件去重account_last_used不应被去重RecordErrorBatch的清洗与失败回退本地 Docker 烟测
docker compose -f /tmp/sub2api-prtest-compose.yml up -d --buildGET /health返回200GET /api/v1/settings/public返回200POST /api/v1/auth/login返回200GET /api/v1/admin/dashboard/snapshot-v2返回200GET /api/v1/admin/dashboard/users-trend返回200本地小并发验证
settings/public并发 30 请求:全部200snapshot-v2并发 20 请求:全部200fork CI
CI:通过Security Scan:通过