[coordination/manager/integration_test] add leader query interface by wangxiyu191 · Pull Request #73 · alibaba/tair-kvcache

wangxiyu191 · 2026-03-24T13:25:36Z

Leader Discovery 接口实现方案

Context

在一主多备 HA 部署中，所有节点挂在同一个负载均衡后面。当前客户端无法通过任意节点发现 Leader 的连接方式，需要自行维护切换逻辑。本次变更包含两部分：
重命名 DistributedLockBackend 为 CoordinationBackend：原名只表达锁语义，现在需要同时承载 KV 存储（节点连接信息），改名为 CoordinationBackend 更准确反映其"分布式协调后端"的定位。
新增 Leader 发现功能：在 CoordinationBackend 上新增 SetValue/GetValue 方法存储节点连接信息，Meta Service 新增 GetClusterInfo RPC，Admin 增强 GetManagerClusterInfo，让客户端可以从任意节点查询 Leader 的 ID 和 4 个连接端口。
本次范围：仅实现 Leader 信息查询，cluster node infos（所有节点列表）留后续迭代。

设计决策

后端改名：DistributedLockBackend → CoordinationBackend，所有实现类和文件同步改名，变量名统一改为 coordination_backend_。配置 key 新增 kvcm.coordination.uri，同时兼容旧 key kvcm.distributed_lock.uri。
连接信息存储：每个节点启动时，在 CoordinationBackend 写入一个独立 KV，value 为包含自身 host + 4 端口的 JSON。查询时根据 leader_node_id 读取对应 key。
Key 模式：_TAIR_KVCM_NODE_INFO_{node_id}
Meta RPC 命名：GetClusterInfo — 表达集群概念，后续可扩展为返回更多集群信息。

实现步骤

Part 1: 重命名 DistributedLockBackend → CoordinationBackend

Part 2: 新增 Leader 发现功能

CoordinationBackend 新增 SetValue/GetValue
三个Coordination后端实现 SetValue/GetValue
LeaderElector 封装 NodeInfo 读写
节点启动时写入连接信息
Proto 定义变更
MetaServiceImpl 和 AdminServiceImpl 变更
相关测试补充

Part 3: 实现advertised_host配置

Part 4: 补充高可用设计文档

Part 5: 为GrpcStub添加新接口支持

wangxiyu191 · 2026-03-24T13:27:43Z

Leader Discovery Interface Implementation

Context

In a primary-standby HA deployment, all nodes sit behind the same load balancer. Currently, clients cannot discover the Leader's connection details through an arbitrary node and must maintain failover logic on their own. This change includes two parts:

Rename DistributedLockBackend to CoordinationBackend: The original name only conveyed lock semantics. Now it also needs to carry KV storage (node connection info). Renaming it to CoordinationBackend more accurately reflects its role as a "distributed coordination backend."

Add Leader Discovery: Introduce SetValue/GetValue methods on CoordinationBackend to store node connection info. Add a GetClusterInfo RPC to Meta Service and enhance GetManagerClusterInfo in Admin, so that clients can query the Leader's ID and 4 connection ports from any node.

Scope of this change: Only Leader info query is implemented; cluster node infos (full node list) is deferred to a future iteration.

Design Decisions

Backend rename: DistributedLockBackend → CoordinationBackend. All implementation classes and files are renamed accordingly; variable names are unified to coordination_backend_. A new config key kvcm.coordination.uri is introduced, while the legacy key kvcm.distributed_lock.uri remains supported for backward compatibility.
Connection info storage: On startup, each node writes an independent KV entry to CoordinationBackend. The value is a JSON object containing the node's host and 4 ports. On query, the corresponding key is read based on leader_node_id.
Key pattern: _TAIR_KVCM_NODE_INFO_{node_id}
Meta RPC naming: GetClusterInfo — conveys the cluster concept and can be extended in the future to return additional cluster information.

Implementation Steps

Part 1: Rename DistributedLockBackend → CoordinationBackend

Part 2: Add Leader Discovery

Add SetValue/GetValue to CoordinationBackend
Implement SetValue/GetValue in the three Coordination backend implementations
Encapsulate NodeInfo read/write in LeaderElector
Write connection info on node startup
Proto definition changes
MetaServiceImpl and AdminServiceImpl changes
Add related tests

Part 3: Implement advertised_host Configuration

Part 4: Add HA Design Documentation

Part 5: Add New Interface Support for GrpcStub

qoderai

👋 Review Summary

The coordination backend refactor and new leader discovery plumbing are well thought through: CoordinationBackend unifies lock + KV semantics, leader endpoint info is propagated out to both meta/admin services, and you’ve added solid tests and docs for the HA flows.

🛡️ Key Risks & Issues

GetClusterInfo currently reports OK even when the leader’s NodeEndpointInfo cannot be read from the coordination backend, leaving leader_endpoint with default values while clients believe the call succeeded. In error or misconfig scenarios this can surface as confusing connection failures on the client side rather than a clear retriable error indicating that leader metadata is unavailable.
MetaServiceImpl explicitly handles a null leader_elector_ as SERVICE_NOT_READY, while AdminServiceImpl’s HA endpoints assume leader_elector_ is always non-null and dereference it directly. In production this is probably fine because Server::Init wires a leader elector, but it does create a behavioral asymmetry compared with MetaServiceImpl and could surprise non-HA or custom test setups.
The new configuration semantics make kvcm.coordination.uri primary and fall back to an in-memory backend when empty. This is convenient for single-node setups but makes it easier to accidentally run multiple nodes against independent in-memory coordination rather than a shared backend if operators forget to configure a URI.

🧪 Verification Advice

In addition to the existing integration tests, I’d recommend adding a targeted case where node endpoint info for the leader is missing or corrupt in CoordinationBackend, and asserting the behavior of both GetClusterInfo and GetManagerClusterInfo (status code and presence/shape of leader_endpoint) so clients can rely on a clear contract in this situation.
If you plan to rely on the NetUtil::GetLocalIp path in real deployments (without configuring kvcm.service.advertised_host), it’s worth running and/or adding tests on a host with multiple non-loopback interfaces to confirm that the chosen IP is actually reachable from your client networks.

💡 Thoughts & Suggestions

Consider tightening the contract around leader endpoint availability: either return a non-OK status when leader_node_id is populated but NodeEndpointInfo cannot be read, or clearly document a convention (for example, leader_endpoint.host empty means “endpoint unavailable”) that clients can check and handle explicitly.
For configuration, you might want a small admin or startup check that warns loudly when multiple nodes start with an empty coordination URI but non-default HA settings, to reduce the risk of unintentionally running multiple “single-node” clusters.
Overall, the direction of unifying coordination concerns under CoordinationBackend and exposing leader discovery via well-typed protobufs and client structs looks good and should make HA behavior easier to reason about going forward.

🤖 Generated by Qoder • View workflow run

qoderai · 2026-03-24T16:05:10Z

kv_cache_manager/service/meta_service_impl.cc

+        }
+    }
+
+    status->set_code(proto::meta::OK);


In GetClusterInfo you always return OK as long as leader_elector_ exists, even if leader_elector_->ReadNodeInfo(leader_node_id, node_info) fails (for example EC_NOENT when node info hasn’t been written yet or backend/JSON errors). In that case leader_endpoint stays at its default values (empty host and zero ports), but clients see a successful response and GrpcStub::GetClusterInfo maps it to ER_OK, giving them unusable connection data with no signal that leader metadata is missing or stale.

Would it be better to surface this more explicitly to callers? For example, we could return a non-OK status (e.g. SERVICE_NOT_READY or a dedicated error) when leader_node_id is set but node info can’t be read, or at least define a contract that clients should treat an empty leader_endpoint.host as “leader endpoint unavailable” and retry/fallback accordingly.

🤖 Generated by Qoder • Fix in Qoder

wangxiyu191 added 10 commits March 24, 2026 21:09

[coordination] rename DistributedLockBackend to CoordinationBackend

ad48c48

[coordination] update BUILD and config for CoordinationBackend rename

47ffbca

[coordination] add KV storage interface to CoordinationBackend

5955f7f

[manager] add leader discovery interface for Meta and Admin services

a64e69d

[manager] introduce NodeEndpointInfo class for endpoint serialization

c7bf8cf

[coordination] change key prefix in redis coordination backend

690bab9

[common] add NetUtil to get local IP and add server address config

e3455a4

[integration] add tests for GetClusterInfo interface

8191cb0

[docs] add design document for HA and LeaderElector

e9d646e

[client] add GetClusterInfo support for GrpcStub

7147b3f

wangxiyu191 requested a review from oldsharp March 24, 2026 13:28

qoderai bot reviewed Mar 24, 2026

View reviewed changes

github-actions bot added the ai reviewed AI has reviewed this PR label Mar 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[coordination/manager/integration_test] add leader query interface#73

[coordination/manager/integration_test] add leader query interface#73
wangxiyu191 wants to merge 10 commits intomainfrom
feature/leader-query-interface

wangxiyu191 commented Mar 24, 2026 •

edited

Loading

Uh oh!

wangxiyu191 commented Mar 24, 2026

Uh oh!

qoderai bot left a comment

Uh oh!

qoderai bot Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wangxiyu191 commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Leader Discovery 接口实现方案

Context

设计决策

实现步骤

Part 1: 重命名 DistributedLockBackend → CoordinationBackend

Part 2: 新增 Leader 发现功能

Part 3: 实现advertised_host配置

Part 4: 补充高可用设计文档

Part 5: 为GrpcStub添加新接口支持

Uh oh!

wangxiyu191 commented Mar 24, 2026

Leader Discovery Interface Implementation

Context

Design Decisions

Implementation Steps

Part 1: Rename DistributedLockBackend → CoordinationBackend

Part 2: Add Leader Discovery

Part 3: Implement advertised_host Configuration

Part 4: Add HA Design Documentation

Part 5: Add New Interface Support for GrpcStub

Uh oh!

qoderai bot left a comment

Choose a reason for hiding this comment

👋 Review Summary

🛡️ Key Risks & Issues

🧪 Verification Advice

💡 Thoughts & Suggestions

Uh oh!

qoderai bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wangxiyu191 commented Mar 24, 2026 •

edited

Loading