Skip to content

Reduce MCP discovery timeouts#121

Open
utafrali wants to merge 1 commit intoDingTalk-Real-AI:mainfrom
utafrali:fix/issue-119-mcp
Open

Reduce MCP discovery timeouts#121
utafrali wants to merge 1 commit intoDingTalk-Real-AI:mainfrom
utafrali:fix/issue-119-mcp

Conversation

@utafrali
Copy link
Copy Markdown

Fixes #119

When MCP plugins are unreachable, the discovery takes forever and blocks everything. Brought down the overall timeout from 10s to 4s and per-server timeout from 5s to 2s. Added a PerServerTimeout field to make it configurable too.

Added tests for the timeout behavior. Build, lint, test, and policy checks all pass. Verified with check-generated-drift.sh and check-command-surface.sh.

When MCP plugins are unreachable, the discovery times out too slowly and blocks commands. Reduced the overall discovery timeout from 10s to 4s and per-server timeout from 5s to 2s. Made the per-server timeout configurable.
Copy link
Copy Markdown
Collaborator

@PeterGuy326 PeterGuy326 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

验证总结 + 建议的补充修复(不 approve,请作者评估)

感谢修复!方向完全正确——DialContext.Timeout 10s→3s 确实命中了超时的一个重要来源。本地复现 + 实测后有两点想和你对齐,希望本 PR 能一次闭环。

一、实测数据

用一个不可达的第三方 streamable-http 插件(endpoint: http://192.0.2.1:8443/mcp,带 headers: {Authorization: ...})触发 registerHTTPServer 的 AuthHeaders 路径:

场景 dws --help 耗时(3 轮均值)
main(10s dial) 10.16s
本 PR(3s dial) 9.12s ← 只省 ~1s
本 PR + 本评论建议的两行补丁 4.11s ← 节省 ~6s

二、根因(为什么本 PR 单独只省 1s)

transport.Client.Initializeinternal/transport/client.go:300)按序尝试 3 个协议版本 [2025-03-26, 2024-11-05, 2024-06-18]。对于不可达端点:

  • main:第 1 版本 dial 就卡满 10s(外层 ctx 也是 10s),等于 1 × 10s = 10s
  • 本 PR:每个版本 dial 3s 失败 → 走下一版本 → 3 × 3s = 9s,刚好没撞到外层 ctx(10s)

Dial 时间减了,但"dial 次数"和"外层 ctx 上限"没动,所以节省被版本循环放大器稀释了。

三、建议补充两处小改动(+20/-3 行)

改动 A:Initialize 在传输层错误时短路(internal/transport/client.go:300

仅当 callJSONRPC 返回的是 JSON-RPC 协议错误(server 回了 -32600 unsupported 这种结构化错误,CallError.Stage == CallStageJSONRPC)时才降级到下一个版本;dial/HTTP/read 失败直接返回,不可能通过换协议版本修好。

改动 B:registerHTTPServer 外层 ctx 10s → 4sinternal/app/root.go:1181

配合 A + PR 的 3s dial,不可达插件 ctx 上限 4s 就够;健康的第三方端点(百炼等)正常 <2s 响应,余量充足。

可直接 cherry-pick 的 diff

diff --git a/internal/app/root.go b/internal/app/root.go
--- a/internal/app/root.go
+++ b/internal/app/root.go
@@ -1178,7 +1178,10 @@ func registerHTTPServer(p *plugin.Plugin, srv market.ServerDescriptor, tc *trans
 	// services may have higher latency than local/DingTalk endpoints).
 	timeout := 2 * time.Second
 	if len(srv.AuthHeaders) > 0 {
-		timeout = 10 * time.Second
+		// Combined with DialContext.Timeout=3s and Initialize short-circuit on
+		// transport errors, 4s bounds startup for unreachable third-party MCP
+		// servers (e.g. Bailian) while still allowing a healthy response.
+		timeout = 4 * time.Second
 	}
 	ctx, cancel := context.WithTimeout(context.Background(), timeout)
 	defer cancel()

diff --git a/internal/transport/client.go b/internal/transport/client.go
--- a/internal/transport/client.go
+++ b/internal/transport/client.go
@@ -298,6 +298,7 @@ func SupportedProtocolVersions() []string {
 }
 
 func (c *Client) Initialize(ctx context.Context, endpoint string) (InitializeResult, error) {
+	var lastErr error
 	for _, version := range SupportedProtocolVersions() {
 		params := map[string]any{
 			"capabilities": map[string]any{},
@@ -309,18 +310,31 @@ func (c *Client) Initialize(ctx context.Context, endpoint string) (InitializeRes
 		}
 
 		var payload InitializeResult
-		if err := c.callJSONRPC(ctx, endpoint, requestEnvelope{
+		err := c.callJSONRPC(ctx, endpoint, requestEnvelope{
 			JSONRPC: "2.0",
 			ID:      1,
 			Method:  "initialize",
 			Params:  params,
-		}, true, &payload); err == nil {
+		}, true, &payload)
+		if err == nil {
 			if payload.ProtocolVersion == "" {
 				payload.ProtocolVersion = version
 			}
 			payload.RequestedProtocolVersion = version
 			return payload, nil
 		}
+		lastErr = err
+		// Only protocol-level JSON-RPC errors justify trying another version.
+		// Transport/HTTP failures will not be fixed by renegotiating, so short
+		// circuit to avoid multiplying dial timeouts (e.g. 3 × 3s on an
+		// unreachable endpoint).
+		var callErr *CallError
+		if !errors.As(err, &callErr) || callErr.Stage != CallStageJSONRPC {
+			return InitializeResult{}, err
+		}
+	}
+	if lastErr != nil {
+		return InitializeResult{}, lastErr
 	}
 	return InitializeResult{}, apperrors.NewDiscovery(fmt.Sprintf(\"initialize failed for all supported protocol versions at %s\", RedactURL(endpoint)))
 }

共 2 个文件,+20/-3 行。本地已验证:

  • go test ./internal/transport/... ./internal/discovery/... 全绿(含现有的 TestInitializeNegotiatesProtocolVersion,因为它走的是 JSON-RPC -32600 错误路径,Stage=jsonrpc,不受短路影响)
  • 实测 dws --help 从 9.1s 降到 4.1s

四、另一个提醒:rebase

PR 当前 base (31eb109) 已落后 main 4 个 merge(包括 #120 的 bugfix/plugin-system、#117 等),建议合并前 rebase 到最新 main。我本地 cherry-pick 到 origin/main 时无冲突,但为了 CI 绿灯和 history 干净还是建议你 rebase 一下。

五、结论

PR 本身方向 100% 正确,但收益被版本循环稀释到 ~1s。希望能在本 PR 里顺手把上面两个小改动合进来——一次闭环,~7s 的真实提升。如果你更希望保持本 PR 的最小颗粒度先合入,我也完全理解,那边的补丁我可以另开 PR。

你定。🙏

(Reviewed by @PeterGuy326 / 修雨 — cherry-picked onto latest main locally, measured end-to-end on darwin/arm64)

@utafrali
Copy link
Copy Markdown
Author

Done, pushed the fix.

@huyz
Copy link
Copy Markdown

huyz commented Apr 17, 2026

@PeterGuy326 I did NOT review this. I don't know this project. You've got the wrong personl

PeterGuy326 pushed a commit that referenced this pull request Apr 17, 2026
…achable

Issue #119: an unreachable third-party MCP plugin (e.g. blocked/firewalled
endpoint) blocks `dws --help` for ~10s on every CLI invocation, because
plugin discovery happens eagerly during command-tree construction.

Root cause was a stack of timeouts that multiplied under transport failure:

* `transport.Client.Initialize` loops three supported protocol versions on
  every error — including dial timeouts and HTTP 5xx — even though those
  failure modes are protocol-version-independent. Three loops × the dial
  budget = 3× the worst-case startup cost.
* Default `DialContext.Timeout` was 10s, so a single dial against an
  unroutable address (e.g. TEST-NET-1) burned the full plugin context.
* `registerHTTPServer` granted plugins with `AuthHeaders` a 10s outer
  context (intended for slow third-party services), and `registerStdioServer`
  granted every stdio plugin 10s. Either single misbehaving plugin therefore
  stalled the entire CLI.
* Registry-side `defaultDiscoveryTimeout` (10s) and `perServerDiscoveryTimeout`
  (5s) had the same shape on the LoadCatalog path.

Changes:

* `transport.Client.Initialize`: short-circuit when the underlying
  `*CallError.Stage` is anything other than `CallStageJSONRPC`. Protocol
  version negotiation is the only justification for retrying with another
  version; transport/HTTP failures fail identically and should surrender.
* `transport.defaultTransport`: `DialContext.Timeout` 10s → 3s.
* `app.registerHTTPServer`: AuthHeaders timeout 10s → 4s.
* `app.registerStdioServer`: outer ctx timeout 10s → 4s.
* `cli.defaultDiscoveryTimeout`: 10s → 4s.
* `discovery.Service`: rename `perServerDiscoveryTimeout` 5s → 2s
  (`defaultPerServerDiscoveryTimeout`); add `Service.PerServerTimeout`
  override field for tests/callers needing a tighter or looser bound.

Tests:

* `TestInitializeShortCircuitsOnHTTPError` — asserts only ONE protocol
  version is attempted on HTTP 5xx.
* `TestInitializeShortCircuitsOnDialFailure` — asserts Initialize returns
  in <2s on a refused-connection address.
* Existing `TestInitializeNegotiatesProtocolVersion` continues to pass —
  JSON-RPC-stage errors still trigger version fallback.

End-to-end measurement with a blackhole plugin
(endpoint=192.0.2.1, AuthHeaders set):

  | Variant       | median `dws --help` |
  | ------------- | ------------------- |
  | main          | 10.2s               |
  | PR #121 alone | 9.1s                |
  | this branch   | 3.72s (-63%)        |

The `loader.go` and `discovery/service.go` timeout reductions overlap with
PR #121 (#121) by @utafrali; crediting via Co-authored-by.

Refs: #119
Supersedes: #121

Co-authored-by: utafrali <8383373+utafrali@users.noreply.github.com>
@utafrali
Copy link
Copy Markdown
Author

Done, pushed the fix.

@PeterGuy326
Copy link
Copy Markdown
Collaborator

PeterGuy326 commented Apr 17, 2026

@utafrali 同步一下:#119 已经在 #125 修好了。

#121 单独把 timeout 改小(10s→4s)只能省 ~1s,根因在 transport.Client.Initialize 会对所有错误(包括网络层)循环 3 个协议版本,把每次 dial 的预算 ×3。所以 #125 的核心改动是给 Initialize 加短路——只有 JSON-RPC 协议错误才换版本重试,传输层错误立即返回。loader.goservice.go 的 timeout 调整你的方向是对的,#125 也合并了。

端到端实测(blackhole plugin,AuthHeaders 路径):

版本 dws --help 中位数
main 10.2s
#121 单独 9.1s
#125 3.72s

感谢你 raise 这个问题,给了排查方向 🙏 #121 怎么处理你来定。

PeterGuy326 pushed a commit that referenced this pull request Apr 17, 2026
…achable

Issue #119: an unreachable third-party MCP plugin (e.g. blocked/firewalled
endpoint) blocks `dws --help` for ~10s on every CLI invocation, because
plugin discovery happens eagerly during command-tree construction.

Root cause was a stack of timeouts that multiplied under transport failure:

* `transport.Client.Initialize` loops three supported protocol versions on
  every error — including dial timeouts and HTTP 5xx — even though those
  failure modes are protocol-version-independent. Three loops × the dial
  budget = 3× the worst-case startup cost.
* Default `DialContext.Timeout` was 10s, so a single dial against an
  unroutable address (e.g. TEST-NET-1) burned the full plugin context.
* `registerHTTPServer` granted plugins with `AuthHeaders` a 10s outer
  context (intended for slow third-party services), and `registerStdioServer`
  granted every stdio plugin 10s. Either single misbehaving plugin therefore
  stalled the entire CLI.
* Registry-side `defaultDiscoveryTimeout` (10s) and `perServerDiscoveryTimeout`
  (5s) had the same shape on the LoadCatalog path.

Changes:

* `transport.Client.Initialize`: short-circuit when the underlying
  `*CallError.Stage` is anything other than `CallStageJSONRPC`. Protocol
  version negotiation is the only justification for retrying with another
  version; transport/HTTP failures fail identically and should surrender.
* `transport.defaultTransport`: `DialContext.Timeout` 10s → 3s.
* `app.registerHTTPServer`: AuthHeaders timeout 10s → 4s.
* `app.registerStdioServer`: outer ctx timeout 10s → 4s.
* `cli.defaultDiscoveryTimeout`: 10s → 4s.
* `discovery.Service`: rename `perServerDiscoveryTimeout` 5s → 2s
  (`defaultPerServerDiscoveryTimeout`); add `Service.PerServerTimeout`
  override field for tests/callers needing a tighter or looser bound.

Tests:

* `TestInitializeShortCircuitsOnHTTPError` — asserts only ONE protocol
  version is attempted on HTTP 5xx.
* `TestInitializeShortCircuitsOnDialFailure` — asserts Initialize returns
  in <2s on a refused-connection address.
* Existing `TestInitializeNegotiatesProtocolVersion` continues to pass —
  JSON-RPC-stage errors still trigger version fallback.

End-to-end measurement with a blackhole plugin
(endpoint=192.0.2.1, AuthHeaders set):

  | Variant       | median `dws --help` |
  | ------------- | ------------------- |
  | main          | 10.2s               |
  | PR #121 alone | 9.1s                |
  | this branch   | 3.72s (-63%)        |

The `loader.go` and `discovery/service.go` timeout reductions overlap with
PR #121 by @utafrali; see PR description for attribution.

Refs: #119
Supersedes: #121
@utafrali
Copy link
Copy Markdown
Author

Done, pushed the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

插件mcp连接不上的话 命令都会等很久

3 participants