Skip to content

[coordination/manager/integration_test] add leader query interface#73

Open
wangxiyu191 wants to merge 10 commits intomainfrom
feature/leader-query-interface
Open

[coordination/manager/integration_test] add leader query interface#73
wangxiyu191 wants to merge 10 commits intomainfrom
feature/leader-query-interface

Conversation

@wangxiyu191
Copy link
Collaborator

@wangxiyu191 wangxiyu191 commented Mar 24, 2026

Leader Discovery 接口实现方案

Context

在一主多备 HA 部署中,所有节点挂在同一个负载均衡后面。当前客户端无法通过任意节点发现 Leader 的连接方式,需要自行维护切换逻辑。本次变更包含两部分:
重命名 DistributedLockBackend 为 CoordinationBackend:原名只表达锁语义,现在需要同时承载 KV 存储(节点连接信息),改名为 CoordinationBackend 更准确反映其"分布式协调后端"的定位。
新增 Leader 发现功能:在 CoordinationBackend 上新增 SetValue/GetValue 方法存储节点连接信息,Meta Service 新增 GetClusterInfo RPC,Admin 增强 GetManagerClusterInfo,让客户端可以从任意节点查询 Leader 的 ID 和 4 个连接端口。
本次范围:仅实现 Leader 信息查询,cluster node infos(所有节点列表)留后续迭代。

设计决策

后端改名:DistributedLockBackend → CoordinationBackend,所有实现类和文件同步改名,变量名统一改为 coordination_backend_。配置 key 新增 kvcm.coordination.uri,同时兼容旧 key kvcm.distributed_lock.uri。
连接信息存储:每个节点启动时,在 CoordinationBackend 写入一个独立 KV,value 为包含自身 host + 4 端口的 JSON。查询时根据 leader_node_id 读取对应 key。
Key 模式:_TAIR_KVCM_NODE_INFO_{node_id}
Meta RPC 命名:GetClusterInfo — 表达集群概念,后续可扩展为返回更多集群信息。

实现步骤

Part 1: 重命名 DistributedLockBackend → CoordinationBackend

Part 2: 新增 Leader 发现功能

  1. CoordinationBackend 新增 SetValue/GetValue
  2. 三个Coordination后端实现 SetValue/GetValue
  3. LeaderElector 封装 NodeInfo 读写
  4. 节点启动时写入连接信息
  5. Proto 定义变更
  6. MetaServiceImpl 和 AdminServiceImpl 变更
  7. 相关测试补充

Part 3: 实现advertised_host配置

Part 4: 补充高可用设计文档

Part 5: 为GrpcStub添加新接口支持

@wangxiyu191
Copy link
Collaborator Author

Leader Discovery Interface Implementation

Context

In a primary-standby HA deployment, all nodes sit behind the same load balancer. Currently, clients cannot discover the Leader's connection details through an arbitrary node and must maintain failover logic on their own. This change includes two parts:

Rename DistributedLockBackend to CoordinationBackend: The original name only conveyed lock semantics. Now it also needs to carry KV storage (node connection info). Renaming it to CoordinationBackend more accurately reflects its role as a "distributed coordination backend."

Add Leader Discovery: Introduce SetValue/GetValue methods on CoordinationBackend to store node connection info. Add a GetClusterInfo RPC to Meta Service and enhance GetManagerClusterInfo in Admin, so that clients can query the Leader's ID and 4 connection ports from any node.

Scope of this change: Only Leader info query is implemented; cluster node infos (full node list) is deferred to a future iteration.

Design Decisions

  • Backend rename: DistributedLockBackend → CoordinationBackend. All implementation classes and files are renamed accordingly; variable names are unified to coordination_backend_. A new config key kvcm.coordination.uri is introduced, while the legacy key kvcm.distributed_lock.uri remains supported for backward compatibility.
  • Connection info storage: On startup, each node writes an independent KV entry to CoordinationBackend. The value is a JSON object containing the node's host and 4 ports. On query, the corresponding key is read based on leader_node_id.
  • Key pattern: _TAIR_KVCM_NODE_INFO_{node_id}
  • Meta RPC naming: GetClusterInfo — conveys the cluster concept and can be extended in the future to return additional cluster information.

Implementation Steps

Part 1: Rename DistributedLockBackend → CoordinationBackend

Part 2: Add Leader Discovery

  1. Add SetValue/GetValue to CoordinationBackend
  2. Implement SetValue/GetValue in the three Coordination backend implementations
  3. Encapsulate NodeInfo read/write in LeaderElector
  4. Write connection info on node startup
  5. Proto definition changes
  6. MetaServiceImpl and AdminServiceImpl changes
  7. Add related tests

Part 3: Implement advertised_host Configuration

Part 4: Add HA Design Documentation

Part 5: Add New Interface Support for GrpcStub

@wangxiyu191 wangxiyu191 requested a review from oldsharp March 24, 2026 13:28
Copy link

@qoderai qoderai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 Review Summary

The coordination backend refactor and new leader discovery plumbing are well thought through: CoordinationBackend unifies lock + KV semantics, leader endpoint info is propagated out to both meta/admin services, and you’ve added solid tests and docs for the HA flows.

🛡️ Key Risks & Issues

  • GetClusterInfo currently reports OK even when the leader’s NodeEndpointInfo cannot be read from the coordination backend, leaving leader_endpoint with default values while clients believe the call succeeded. In error or misconfig scenarios this can surface as confusing connection failures on the client side rather than a clear retriable error indicating that leader metadata is unavailable.
  • MetaServiceImpl explicitly handles a null leader_elector_ as SERVICE_NOT_READY, while AdminServiceImpl’s HA endpoints assume leader_elector_ is always non-null and dereference it directly. In production this is probably fine because Server::Init wires a leader elector, but it does create a behavioral asymmetry compared with MetaServiceImpl and could surprise non-HA or custom test setups.
  • The new configuration semantics make kvcm.coordination.uri primary and fall back to an in-memory backend when empty. This is convenient for single-node setups but makes it easier to accidentally run multiple nodes against independent in-memory coordination rather than a shared backend if operators forget to configure a URI.

🧪 Verification Advice

  • In addition to the existing integration tests, I’d recommend adding a targeted case where node endpoint info for the leader is missing or corrupt in CoordinationBackend, and asserting the behavior of both GetClusterInfo and GetManagerClusterInfo (status code and presence/shape of leader_endpoint) so clients can rely on a clear contract in this situation.
  • If you plan to rely on the NetUtil::GetLocalIp path in real deployments (without configuring kvcm.service.advertised_host), it’s worth running and/or adding tests on a host with multiple non-loopback interfaces to confirm that the chosen IP is actually reachable from your client networks.

💡 Thoughts & Suggestions

  • Consider tightening the contract around leader endpoint availability: either return a non-OK status when leader_node_id is populated but NodeEndpointInfo cannot be read, or clearly document a convention (for example, leader_endpoint.host empty means “endpoint unavailable”) that clients can check and handle explicitly.
  • For configuration, you might want a small admin or startup check that warns loudly when multiple nodes start with an empty coordination URI but non-default HA settings, to reduce the risk of unintentionally running multiple “single-node” clusters.
  • Overall, the direction of unifying coordination concerns under CoordinationBackend and exposing leader discovery via well-typed protobufs and client structs looks good and should make HA behavior easier to reason about going forward.

🤖 Generated by QoderView workflow run

}
}

status->set_code(proto::meta::OK);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In GetClusterInfo you always return OK as long as leader_elector_ exists, even if leader_elector_->ReadNodeInfo(leader_node_id, node_info) fails (for example EC_NOENT when node info hasn’t been written yet or backend/JSON errors). In that case leader_endpoint stays at its default values (empty host and zero ports), but clients see a successful response and GrpcStub::GetClusterInfo maps it to ER_OK, giving them unusable connection data with no signal that leader metadata is missing or stale.

Would it be better to surface this more explicitly to callers? For example, we could return a non-OK status (e.g. SERVICE_NOT_READY or a dedicated error) when leader_node_id is set but node info can’t be read, or at least define a contract that clients should treat an empty leader_endpoint.host as “leader endpoint unavailable” and retry/fallback accordingly.


🤖 Generated by QoderFix in Qoder

@github-actions github-actions bot added the ai reviewed AI has reviewed this PR label Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai reviewed AI has reviewed this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant