Reduce MCP discovery timeouts by utafrali · Pull Request #121 · DingTalk-Real-AI/dingtalk-workspace-cli

utafrali · 2026-04-17T02:15:01Z

Fixes #119

When MCP plugins are unreachable, the discovery takes forever and blocks everything. Brought down the overall timeout from 10s to 4s and per-server timeout from 5s to 2s. Added a PerServerTimeout field to make it configurable too.

Added tests for the timeout behavior. Build, lint, test, and policy checks all pass. Verified with check-generated-drift.sh and check-command-surface.sh.

When MCP plugins are unreachable, the discovery times out too slowly and blocks commands. Reduced the overall discovery timeout from 10s to 4s and per-server timeout from 5s to 2s. Made the per-server timeout configurable.

PeterGuy326

验证总结 + 建议的补充修复（不 approve，请作者评估）

感谢修复！方向完全正确——DialContext.Timeout 10s→3s 确实命中了超时的一个重要来源。本地复现 + 实测后有两点想和你对齐，希望本 PR 能一次闭环。

一、实测数据

用一个不可达的第三方 streamable-http 插件（endpoint: http://192.0.2.1:8443/mcp，带 headers: {Authorization: ...}）触发 registerHTTPServer 的 AuthHeaders 路径：

场景	`dws --help` 耗时（3 轮均值）
main（10s dial）	10.16s
本 PR（3s dial）	9.12s ← 只省 ~1s
本 PR + 本评论建议的两行补丁	4.11s ← 节省 ~6s

二、根因（为什么本 PR 单独只省 1s）

transport.Client.Initialize（internal/transport/client.go:300）按序尝试 3 个协议版本 [2025-03-26, 2024-11-05, 2024-06-18]。对于不可达端点：

main：第 1 版本 dial 就卡满 10s（外层 ctx 也是 10s），等于 1 × 10s = 10s
本 PR：每个版本 dial 3s 失败 → 走下一版本 → 3 × 3s = 9s，刚好没撞到外层 ctx（10s）

Dial 时间减了，但"dial 次数"和"外层 ctx 上限"没动，所以节省被版本循环放大器稀释了。

三、建议补充两处小改动（+20/-3 行）

改动 A：`Initialize` 在传输层错误时短路（`internal/transport/client.go:300`）

仅当 callJSONRPC 返回的是 JSON-RPC 协议错误（server 回了 -32600 unsupported 这种结构化错误，CallError.Stage == CallStageJSONRPC）时才降级到下一个版本；dial/HTTP/read 失败直接返回，不可能通过换协议版本修好。

改动 B：`registerHTTPServer` 外层 ctx `10s → 4s`（`internal/app/root.go:1181`）

配合 A + PR 的 3s dial，不可达插件 ctx 上限 4s 就够；健康的第三方端点（百炼等）正常 <2s 响应，余量充足。

可直接 cherry-pick 的 diff

diff --git a/internal/app/root.go b/internal/app/root.go
--- a/internal/app/root.go
+++ b/internal/app/root.go
@@ -1178,7 +1178,10 @@ func registerHTTPServer(p *plugin.Plugin, srv market.ServerDescriptor, tc *trans
 	// services may have higher latency than local/DingTalk endpoints).
 	timeout := 2 * time.Second
 	if len(srv.AuthHeaders) > 0 {
-		timeout = 10 * time.Second
+		// Combined with DialContext.Timeout=3s and Initialize short-circuit on
+		// transport errors, 4s bounds startup for unreachable third-party MCP
+		// servers (e.g. Bailian) while still allowing a healthy response.
+		timeout = 4 * time.Second
 	}
 	ctx, cancel := context.WithTimeout(context.Background(), timeout)
 	defer cancel()

diff --git a/internal/transport/client.go b/internal/transport/client.go
--- a/internal/transport/client.go
+++ b/internal/transport/client.go
@@ -298,6 +298,7 @@ func SupportedProtocolVersions() []string {
 }
 
 func (c *Client) Initialize(ctx context.Context, endpoint string) (InitializeResult, error) {
+	var lastErr error
 	for _, version := range SupportedProtocolVersions() {
 		params := map[string]any{
 			"capabilities": map[string]any{},
@@ -309,18 +310,31 @@ func (c *Client) Initialize(ctx context.Context, endpoint string) (InitializeRes
 		}
 
 		var payload InitializeResult
-		if err := c.callJSONRPC(ctx, endpoint, requestEnvelope{
+		err := c.callJSONRPC(ctx, endpoint, requestEnvelope{
 			JSONRPC: "2.0",
 			ID:      1,
 			Method:  "initialize",
 			Params:  params,
-		}, true, &payload); err == nil {
+		}, true, &payload)
+		if err == nil {
 			if payload.ProtocolVersion == "" {
 				payload.ProtocolVersion = version
 			}
 			payload.RequestedProtocolVersion = version
 			return payload, nil
 		}
+		lastErr = err
+		// Only protocol-level JSON-RPC errors justify trying another version.
+		// Transport/HTTP failures will not be fixed by renegotiating, so short
+		// circuit to avoid multiplying dial timeouts (e.g. 3 × 3s on an
+		// unreachable endpoint).
+		var callErr *CallError
+		if !errors.As(err, &callErr) || callErr.Stage != CallStageJSONRPC {
+			return InitializeResult{}, err
+		}
+	}
+	if lastErr != nil {
+		return InitializeResult{}, lastErr
 	}
 	return InitializeResult{}, apperrors.NewDiscovery(fmt.Sprintf(\"initialize failed for all supported protocol versions at %s\", RedactURL(endpoint)))
 }

共 2 个文件，+20/-3 行。本地已验证：

go test ./internal/transport/... ./internal/discovery/... 全绿（含现有的 TestInitializeNegotiatesProtocolVersion，因为它走的是 JSON-RPC -32600 错误路径，Stage=jsonrpc，不受短路影响）
实测 dws --help 从 9.1s 降到 4.1s

四、另一个提醒：rebase

PR 当前 base (31eb109) 已落后 main 4 个 merge（包括 #120 的 bugfix/plugin-system、#117 等），建议合并前 rebase 到最新 main。我本地 cherry-pick 到 origin/main 时无冲突，但为了 CI 绿灯和 history 干净还是建议你 rebase 一下。

五、结论

PR 本身方向 100% 正确，但收益被版本循环稀释到 ~1s。希望能在本 PR 里顺手把上面两个小改动合进来——一次闭环，~7s 的真实提升。如果你更希望保持本 PR 的最小颗粒度先合入，我也完全理解，那边的补丁我可以另开 PR。

你定。🙏

(Reviewed by @PeterGuy326 / 修雨 — cherry-picked onto latest main locally, measured end-to-end on darwin/arm64)

utafrali · 2026-04-17T08:03:41Z

Done, pushed the fix.

huyz · 2026-04-17T08:45:39Z

@PeterGuy326 I did NOT review this. I don't know this project. You've got the wrong personl

@utafrali

…achable Issue #119: an unreachable third-party MCP plugin (e.g. blocked/firewalled endpoint) blocks `dws --help` for ~10s on every CLI invocation, because plugin discovery happens eagerly during command-tree construction. Root cause was a stack of timeouts that multiplied under transport failure: * `transport.Client.Initialize` loops three supported protocol versions on every error — including dial timeouts and HTTP 5xx — even though those failure modes are protocol-version-independent. Three loops × the dial budget = 3× the worst-case startup cost. * Default `DialContext.Timeout` was 10s, so a single dial against an unroutable address (e.g. TEST-NET-1) burned the full plugin context. * `registerHTTPServer` granted plugins with `AuthHeaders` a 10s outer context (intended for slow third-party services), and `registerStdioServer` granted every stdio plugin 10s. Either single misbehaving plugin therefore stalled the entire CLI. * Registry-side `defaultDiscoveryTimeout` (10s) and `perServerDiscoveryTimeout` (5s) had the same shape on the LoadCatalog path. Changes: * `transport.Client.Initialize`: short-circuit when the underlying `*CallError.Stage` is anything other than `CallStageJSONRPC`. Protocol version negotiation is the only justification for retrying with another version; transport/HTTP failures fail identically and should surrender. * `transport.defaultTransport`: `DialContext.Timeout` 10s → 3s. * `app.registerHTTPServer`: AuthHeaders timeout 10s → 4s. * `app.registerStdioServer`: outer ctx timeout 10s → 4s. * `cli.defaultDiscoveryTimeout`: 10s → 4s. * `discovery.Service`: rename `perServerDiscoveryTimeout` 5s → 2s (`defaultPerServerDiscoveryTimeout`); add `Service.PerServerTimeout` override field for tests/callers needing a tighter or looser bound. Tests: * `TestInitializeShortCircuitsOnHTTPError` — asserts only ONE protocol version is attempted on HTTP 5xx. * `TestInitializeShortCircuitsOnDialFailure` — asserts Initialize returns in <2s on a refused-connection address. * Existing `TestInitializeNegotiatesProtocolVersion` continues to pass — JSON-RPC-stage errors still trigger version fallback. End-to-end measurement with a blackhole plugin (endpoint=192.0.2.1, AuthHeaders set): | Variant | median `dws --help` | | ------------- | ------------------- | | main | 10.2s | | PR #121 alone | 9.1s | | this branch | 3.72s (-63%) | The `loader.go` and `discovery/service.go` timeout reductions overlap with PR #121 (#121) by @utafrali; crediting via Co-authored-by. Refs: #119 Supersedes: #121 Co-authored-by: utafrali <8383373+utafrali@users.noreply.github.com>

utafrali · 2026-04-17T09:39:17Z

Done, pushed the fix.

PeterGuy326 · 2026-04-17T09:41:33Z

@utafrali 同步一下：#119 已经在 #125 修好了。

#121 单独把 timeout 改小（10s→4s）只能省 ~1s，根因在 transport.Client.Initialize 会对所有错误（包括网络层）循环 3 个协议版本，把每次 dial 的预算 ×3。所以 #125 的核心改动是给 Initialize 加短路——只有 JSON-RPC 协议错误才换版本重试，传输层错误立即返回。loader.go 和 service.go 的 timeout 调整你的方向是对的，#125 也合并了。

端到端实测（blackhole plugin，AuthHeaders 路径）：

版本	`dws --help` 中位数
main	10.2s
#121 单独	9.1s
#125	3.72s

感谢你 raise 这个问题，给了排查方向 🙏 #121 怎么处理你来定。

@utafrali

…achable Issue #119: an unreachable third-party MCP plugin (e.g. blocked/firewalled endpoint) blocks `dws --help` for ~10s on every CLI invocation, because plugin discovery happens eagerly during command-tree construction. Root cause was a stack of timeouts that multiplied under transport failure: * `transport.Client.Initialize` loops three supported protocol versions on every error — including dial timeouts and HTTP 5xx — even though those failure modes are protocol-version-independent. Three loops × the dial budget = 3× the worst-case startup cost. * Default `DialContext.Timeout` was 10s, so a single dial against an unroutable address (e.g. TEST-NET-1) burned the full plugin context. * `registerHTTPServer` granted plugins with `AuthHeaders` a 10s outer context (intended for slow third-party services), and `registerStdioServer` granted every stdio plugin 10s. Either single misbehaving plugin therefore stalled the entire CLI. * Registry-side `defaultDiscoveryTimeout` (10s) and `perServerDiscoveryTimeout` (5s) had the same shape on the LoadCatalog path. Changes: * `transport.Client.Initialize`: short-circuit when the underlying `*CallError.Stage` is anything other than `CallStageJSONRPC`. Protocol version negotiation is the only justification for retrying with another version; transport/HTTP failures fail identically and should surrender. * `transport.defaultTransport`: `DialContext.Timeout` 10s → 3s. * `app.registerHTTPServer`: AuthHeaders timeout 10s → 4s. * `app.registerStdioServer`: outer ctx timeout 10s → 4s. * `cli.defaultDiscoveryTimeout`: 10s → 4s. * `discovery.Service`: rename `perServerDiscoveryTimeout` 5s → 2s (`defaultPerServerDiscoveryTimeout`); add `Service.PerServerTimeout` override field for tests/callers needing a tighter or looser bound. Tests: * `TestInitializeShortCircuitsOnHTTPError` — asserts only ONE protocol version is attempted on HTTP 5xx. * `TestInitializeShortCircuitsOnDialFailure` — asserts Initialize returns in <2s on a refused-connection address. * Existing `TestInitializeNegotiatesProtocolVersion` continues to pass — JSON-RPC-stage errors still trigger version fallback. End-to-end measurement with a blackhole plugin (endpoint=192.0.2.1, AuthHeaders set): | Variant | median `dws --help` | | ------------- | ------------------- | | main | 10.2s | | PR #121 alone | 9.1s | | this branch | 3.72s (-63%) | The `loader.go` and `discovery/service.go` timeout reductions overlap with PR #121 by @utafrali; see PR description for attribution. Refs: #119 Supersedes: #121

utafrali · 2026-04-17T10:45:46Z

Done, pushed the fix.

Reduce MCP discovery timeouts

a078599

When MCP plugins are unreachable, the discovery times out too slowly and blocks commands. Reduced the overall discovery timeout from 10s to 4s and per-server timeout from 5s to 2s. Made the per-server timeout configurable.

PeterGuy326 reviewed Apr 17, 2026

View reviewed changes

PeterGuy326 mentioned this pull request Apr 17, 2026

perf(plugin): fast plugin MCP startup via cache + parallel discovery (#119) #125

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce MCP discovery timeouts#121

Reduce MCP discovery timeouts#121
utafrali wants to merge 1 commit intoDingTalk-Real-AI:mainfrom
utafrali:fix/issue-119-mcp

utafrali commented Apr 17, 2026

Uh oh!

PeterGuy326 left a comment •

edited

Loading

Uh oh!

utafrali commented Apr 17, 2026

Uh oh!

huyz commented Apr 17, 2026

Uh oh!

utafrali commented Apr 17, 2026

Uh oh!

PeterGuy326 commented Apr 17, 2026 •

edited

Loading

Uh oh!

utafrali commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

utafrali commented Apr 17, 2026

Uh oh!

PeterGuy326 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

验证总结 + 建议的补充修复（不 approve，请作者评估）

一、实测数据

二、根因（为什么本 PR 单独只省 1s）

三、建议补充两处小改动（+20/-3 行）

改动 A：Initialize 在传输层错误时短路（internal/transport/client.go:300）

改动 B：registerHTTPServer 外层 ctx 10s → 4s（internal/app/root.go:1181）

可直接 cherry-pick 的 diff

四、另一个提醒：rebase

五、结论

Uh oh!

utafrali commented Apr 17, 2026

Uh oh!

huyz commented Apr 17, 2026

Uh oh!

utafrali commented Apr 17, 2026

Uh oh!

PeterGuy326 commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

utafrali commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PeterGuy326 left a comment •

edited

Loading

改动 A：`Initialize` 在传输层错误时短路（`internal/transport/client.go:300`）

改动 B：`registerHTTPServer` 外层 ctx `10s → 4s`（`internal/app/root.go:1181`）

PeterGuy326 commented Apr 17, 2026 •

edited

Loading