[coordination/manager/integration_test] add leader query interface#73
[coordination/manager/integration_test] add leader query interface#73wangxiyu191 wants to merge 10 commits intomainfrom
Conversation
Leader Discovery Interface ImplementationContextIn a primary-standby HA deployment, all nodes sit behind the same load balancer. Currently, clients cannot discover the Leader's connection details through an arbitrary node and must maintain failover logic on their own. This change includes two parts: Rename DistributedLockBackend to CoordinationBackend: The original name only conveyed lock semantics. Now it also needs to carry KV storage (node connection info). Renaming it to CoordinationBackend more accurately reflects its role as a "distributed coordination backend." Add Leader Discovery: Introduce SetValue/GetValue methods on CoordinationBackend to store node connection info. Add a GetClusterInfo RPC to Meta Service and enhance GetManagerClusterInfo in Admin, so that clients can query the Leader's ID and 4 connection ports from any node. Scope of this change: Only Leader info query is implemented; cluster node infos (full node list) is deferred to a future iteration. Design Decisions
Implementation StepsPart 1: Rename DistributedLockBackend → CoordinationBackendPart 2: Add Leader Discovery
Part 3: Implement advertised_host ConfigurationPart 4: Add HA Design DocumentationPart 5: Add New Interface Support for GrpcStub |
There was a problem hiding this comment.
👋 Review Summary
The coordination backend refactor and new leader discovery plumbing are well thought through: CoordinationBackend unifies lock + KV semantics, leader endpoint info is propagated out to both meta/admin services, and you’ve added solid tests and docs for the HA flows.
🛡️ Key Risks & Issues
- GetClusterInfo currently reports
OKeven when the leader’s NodeEndpointInfo cannot be read from the coordination backend, leavingleader_endpointwith default values while clients believe the call succeeded. In error or misconfig scenarios this can surface as confusing connection failures on the client side rather than a clear retriable error indicating that leader metadata is unavailable. - MetaServiceImpl explicitly handles a null leader_elector_ as SERVICE_NOT_READY, while AdminServiceImpl’s HA endpoints assume leader_elector_ is always non-null and dereference it directly. In production this is probably fine because Server::Init wires a leader elector, but it does create a behavioral asymmetry compared with MetaServiceImpl and could surprise non-HA or custom test setups.
- The new configuration semantics make
kvcm.coordination.uriprimary and fall back to an in-memory backend when empty. This is convenient for single-node setups but makes it easier to accidentally run multiple nodes against independent in-memory coordination rather than a shared backend if operators forget to configure a URI.
🧪 Verification Advice
- In addition to the existing integration tests, I’d recommend adding a targeted case where node endpoint info for the leader is missing or corrupt in CoordinationBackend, and asserting the behavior of both GetClusterInfo and GetManagerClusterInfo (status code and presence/shape of leader_endpoint) so clients can rely on a clear contract in this situation.
- If you plan to rely on the NetUtil::GetLocalIp path in real deployments (without configuring
kvcm.service.advertised_host), it’s worth running and/or adding tests on a host with multiple non-loopback interfaces to confirm that the chosen IP is actually reachable from your client networks.
💡 Thoughts & Suggestions
- Consider tightening the contract around leader endpoint availability: either return a non-OK status when
leader_node_idis populated but NodeEndpointInfo cannot be read, or clearly document a convention (for example,leader_endpoint.hostempty means “endpoint unavailable”) that clients can check and handle explicitly. - For configuration, you might want a small admin or startup check that warns loudly when multiple nodes start with an empty coordination URI but non-default HA settings, to reduce the risk of unintentionally running multiple “single-node” clusters.
- Overall, the direction of unifying coordination concerns under CoordinationBackend and exposing leader discovery via well-typed protobufs and client structs looks good and should make HA behavior easier to reason about going forward.
🤖 Generated by Qoder • View workflow run
| } | ||
| } | ||
|
|
||
| status->set_code(proto::meta::OK); |
There was a problem hiding this comment.
In GetClusterInfo you always return OK as long as leader_elector_ exists, even if leader_elector_->ReadNodeInfo(leader_node_id, node_info) fails (for example EC_NOENT when node info hasn’t been written yet or backend/JSON errors). In that case leader_endpoint stays at its default values (empty host and zero ports), but clients see a successful response and GrpcStub::GetClusterInfo maps it to ER_OK, giving them unusable connection data with no signal that leader metadata is missing or stale.
Would it be better to surface this more explicitly to callers? For example, we could return a non-OK status (e.g. SERVICE_NOT_READY or a dedicated error) when leader_node_id is set but node info can’t be read, or at least define a contract that clients should treat an empty leader_endpoint.host as “leader endpoint unavailable” and retry/fallback accordingly.
🤖 Generated by Qoder • Fix in Qoder
Leader Discovery 接口实现方案
Context
在一主多备 HA 部署中,所有节点挂在同一个负载均衡后面。当前客户端无法通过任意节点发现 Leader 的连接方式,需要自行维护切换逻辑。本次变更包含两部分:
重命名 DistributedLockBackend 为 CoordinationBackend:原名只表达锁语义,现在需要同时承载 KV 存储(节点连接信息),改名为 CoordinationBackend 更准确反映其"分布式协调后端"的定位。
新增 Leader 发现功能:在 CoordinationBackend 上新增 SetValue/GetValue 方法存储节点连接信息,Meta Service 新增 GetClusterInfo RPC,Admin 增强 GetManagerClusterInfo,让客户端可以从任意节点查询 Leader 的 ID 和 4 个连接端口。
本次范围:仅实现 Leader 信息查询,cluster node infos(所有节点列表)留后续迭代。
设计决策
后端改名:DistributedLockBackend → CoordinationBackend,所有实现类和文件同步改名,变量名统一改为 coordination_backend_。配置 key 新增 kvcm.coordination.uri,同时兼容旧 key kvcm.distributed_lock.uri。
连接信息存储:每个节点启动时,在 CoordinationBackend 写入一个独立 KV,value 为包含自身 host + 4 端口的 JSON。查询时根据 leader_node_id 读取对应 key。
Key 模式:
_TAIR_KVCM_NODE_INFO_{node_id}Meta RPC 命名:GetClusterInfo — 表达集群概念,后续可扩展为返回更多集群信息。
实现步骤
Part 1: 重命名 DistributedLockBackend → CoordinationBackend
Part 2: 新增 Leader 发现功能
Part 3: 实现advertised_host配置
Part 4: 补充高可用设计文档
Part 5: 为GrpcStub添加新接口支持