Skip to content

feat(BA-5578): refactor ValkeyClient to async with acquire() pattern#10766

Draft
jopemachine wants to merge 4 commits intomainfrom
feat/BA-5578-valkey-client-acquire-pattern
Draft

feat(BA-5578): refactor ValkeyClient to async with acquire() pattern#10766
jopemachine wants to merge 4 commits intomainfrom
feat/BA-5578-valkey-client-acquire-pattern

Conversation

@jopemachine
Copy link
Copy Markdown
Member

@jopemachine jopemachine commented Apr 3, 2026

Resolves BA-5578.

Summary

  • Add acquire() async context manager to AbstractValkeyClient and MonitoringValkeyClient that wraps operations and tracks consecutive connection failures
  • When the failure threshold is exceeded, the connection is torn down and reconnected rather than continuing to retry on a broken connection
  • Migrate all 13 domain-specific Valkey clients to use the new async with self._client.acquire() as conn: pattern
  • Add tests covering the retry-based disconnection logic

Changes

  • Core (client.py): Define _VALKEY_CONNECTION_ERRORS tuple, add acquire() base implementation to AbstractValkeyClient, override in MonitoringValkeyClient with _operation_failure_count tracking and threshold-based _reconnect() trigger
  • Domain clients (13 files): Replace all self._client.client.xxx() calls with async with self._client.acquire() as conn: pattern
  • Tests: Add TestMonitoringValkeyClientAcquire class with 8 test cases covering success/failure counting, threshold detection, reconnection, and error type tracking

Test plan

  • pants fmt/fix/lint --changed-since=HEAD passes
  • pants check --changed-since=HEAD (mypy) passes
  • pants test --changed-since=HEAD passes
  • Verify in Standalone mode
  • Verify in Sentinel mode

🤖 Generated with Claude Code

…ttern

Add an acquire() async context manager to AbstractValkeyClient that
wraps operations and tracks consecutive connection failures. When the
failure threshold is exceeded, the connection is torn down and
reconnected rather than continuing to retry on a broken connection.

- Add _VALKEY_CONNECTION_ERRORS tuple for connection error detection
- Add acquire() base implementation to AbstractValkeyClient
- Override acquire() in MonitoringValkeyClient with failure tracking
  and threshold-based reconnection
- Migrate all 13 domain-specific Valkey clients to use the new pattern
- Add tests for retry-based disconnection logic

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 3, 2026 02:39
@github-actions github-actions bot added size:XL 500~ LoC comp:common Related to Common component labels Apr 3, 2026
@jopemachine jopemachine marked this pull request as draft April 3, 2026 02:39
@jopemachine jopemachine added this to the 26.4 milestone Apr 3, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the Valkey client abstraction to standardize operations behind an async with ... acquire() context manager, enabling operation-level failure tracking and reconnect-on-threshold behavior, and migrates all domain Valkey clients to use the new pattern.

Changes:

  • Added AbstractValkeyClient.acquire() async context manager and a MonitoringValkeyClient.acquire() override that counts consecutive operation connection failures and triggers _reconnect() at a threshold.
  • Migrated domain-specific Valkey clients to use async with self._client.acquire() as conn: for all Valkey operations.
  • Added unit tests validating failure counting, threshold behavior, and reconnection outcomes for MonitoringValkeyClient.acquire().

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/ai/backend/common/clients/valkey_client/client.py Introduces acquire() and implements operation failure tracking + threshold-based reconnection.
tests/unit/common/clients/valkey_client/test_monitoring_valkey_client.py Adds test coverage for the new acquire() behavior and reconnection triggering.
src/ai/backend/common/clients/valkey_client/valkey_volume_stats/client.py Migrates volume stats operations to the acquire() pattern.
src/ai/backend/common/clients/valkey_client/valkey_stream/client.py Migrates stream operations (xgroup/xreadgroup/etc.) to the acquire() pattern.
src/ai/backend/common/clients/valkey_client/valkey_stat/client.py Migrates stat/cache operations and batched execs to the acquire() pattern.
src/ai/backend/common/clients/valkey_client/valkey_session/client.py Migrates session CRUD and time/ping to the acquire() pattern.
src/ai/backend/common/clients/valkey_client/valkey_schedule/client.py Migrates scheduling state operations and batch execs to the acquire() pattern.
src/ai/backend/common/clients/valkey_client/valkey_rate_limit/client.py Migrates rate-limit scripts and sorted-set operations to the acquire() pattern.
src/ai/backend/common/clients/valkey_client/valkey_live/client.py Migrates live data operations, scans, and pipelines to the acquire() pattern.
src/ai/backend/common/clients/valkey_client/valkey_leader/client.py Migrates leadership Lua script invocations to the acquire() pattern.
src/ai/backend/common/clients/valkey_client/valkey_image/client.py Migrates image membership operations and scans to the acquire() pattern.
src/ai/backend/common/clients/valkey_client/valkey_container_log/client.py Migrates list/log operations and batch execs to the acquire() pattern.
src/ai/backend/common/clients/valkey_client/valkey_bgtask/client.py Migrates task metadata operations, scripts, and execs to the acquire() pattern.
src/ai/backend/common/clients/valkey_client/valkey_artifact/client.py Migrates artifact progress tracking (get/exec/hincrby/hset/scan) to the acquire() pattern.
src/ai/backend/common/clients/valkey_client/valkey_artifact_registries/client.py Migrates registry CRUD operations to the acquire() pattern.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +526 to +533
if self._operation_failure_count >= self._consecutive_failure_threshold:
log.warning(
"Operation failure threshold reached ({}), triggering reconnection...",
self._consecutive_failure_threshold,
)
self._operation_failure_count = 0
await self._reconnect()
raise
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MonitoringValkeyClient.acquire() can call await self._reconnect() while the background monitor task may also call _reconnect() (and multiple concurrent operations can hit the threshold together). Because _reconnect() disconnects/connects both clients, concurrent invocations can race and lead to flapping reconnects or disconnecting a client while another reconnect is in progress. Consider serializing reconnects with an asyncio.Lock (e.g., self._reconnect_lock) and using it in both acquire() and the monitor loop (or inside _reconnect() itself).

Copilot uses AI. Check for mistakes.
Comment on lines 10 to 14
from ai.backend.common.clients.valkey_client.client import (
_VALKEY_CONNECTION_ERRORS,
MonitoringValkeyClient,
create_valkey_client,
)
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test imports _VALKEY_CONNECTION_ERRORS (a private module constant). This makes the test suite tightly coupled to an internal name that may change during refactors. Prefer either (a) exposing a public constant (no leading underscore) intended for reuse, or (b) asserting on the public behavior (e.g., raising ClientNotConnectedError / ClosingError etc.) without importing private internals.

Copilot uses AI. Check for mistakes.
jopemachine and others added 2 commits April 3, 2026 16:11
The ValkeyLeaderClient was refactored to use `async with self._client.acquire()`,
but the test mock didn't support the async context manager protocol.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add MonitoringValkeyClientSpec dataclass to group configuration fields
  (reconnectable_exceptions, consecutive_failure_threshold, monitor_interval)
- Make AbstractValkeyClient.acquire() a proper abstract method
- Add acquire() implementations to ValkeyStandaloneClient and ValkeySentinelClient

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:common Related to Common component size:XL 500~ LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants