Conversation
When MCP plugins are unreachable, the discovery times out too slowly and blocks commands. Reduced the overall discovery timeout from 10s to 4s and per-server timeout from 5s to 2s. Made the per-server timeout configurable.
There was a problem hiding this comment.
验证总结 + 建议的补充修复(不 approve,请作者评估)
感谢修复!方向完全正确——DialContext.Timeout 10s→3s 确实命中了超时的一个重要来源。本地复现 + 实测后有两点想和你对齐,希望本 PR 能一次闭环。
一、实测数据
用一个不可达的第三方 streamable-http 插件(endpoint: http://192.0.2.1:8443/mcp,带 headers: {Authorization: ...})触发 registerHTTPServer 的 AuthHeaders 路径:
| 场景 | dws --help 耗时(3 轮均值) |
|---|---|
| main(10s dial) | 10.16s |
| 本 PR(3s dial) | 9.12s ← 只省 ~1s |
| 本 PR + 本评论建议的两行补丁 | 4.11s ← 节省 ~6s |
二、根因(为什么本 PR 单独只省 1s)
transport.Client.Initialize(internal/transport/client.go:300)按序尝试 3 个协议版本 [2025-03-26, 2024-11-05, 2024-06-18]。对于不可达端点:
- main:第 1 版本 dial 就卡满 10s(外层 ctx 也是 10s),等于 1 × 10s = 10s
- 本 PR:每个版本 dial 3s 失败 → 走下一版本 → 3 × 3s = 9s,刚好没撞到外层 ctx(10s)
Dial 时间减了,但"dial 次数"和"外层 ctx 上限"没动,所以节省被版本循环放大器稀释了。
三、建议补充两处小改动(+20/-3 行)
改动 A:Initialize 在传输层错误时短路(internal/transport/client.go:300)
仅当 callJSONRPC 返回的是 JSON-RPC 协议错误(server 回了 -32600 unsupported 这种结构化错误,CallError.Stage == CallStageJSONRPC)时才降级到下一个版本;dial/HTTP/read 失败直接返回,不可能通过换协议版本修好。
改动 B:registerHTTPServer 外层 ctx 10s → 4s(internal/app/root.go:1181)
配合 A + PR 的 3s dial,不可达插件 ctx 上限 4s 就够;健康的第三方端点(百炼等)正常 <2s 响应,余量充足。
可直接 cherry-pick 的 diff
diff --git a/internal/app/root.go b/internal/app/root.go
--- a/internal/app/root.go
+++ b/internal/app/root.go
@@ -1178,7 +1178,10 @@ func registerHTTPServer(p *plugin.Plugin, srv market.ServerDescriptor, tc *trans
// services may have higher latency than local/DingTalk endpoints).
timeout := 2 * time.Second
if len(srv.AuthHeaders) > 0 {
- timeout = 10 * time.Second
+ // Combined with DialContext.Timeout=3s and Initialize short-circuit on
+ // transport errors, 4s bounds startup for unreachable third-party MCP
+ // servers (e.g. Bailian) while still allowing a healthy response.
+ timeout = 4 * time.Second
}
ctx, cancel := context.WithTimeout(context.Background(), timeout)
defer cancel()
diff --git a/internal/transport/client.go b/internal/transport/client.go
--- a/internal/transport/client.go
+++ b/internal/transport/client.go
@@ -298,6 +298,7 @@ func SupportedProtocolVersions() []string {
}
func (c *Client) Initialize(ctx context.Context, endpoint string) (InitializeResult, error) {
+ var lastErr error
for _, version := range SupportedProtocolVersions() {
params := map[string]any{
"capabilities": map[string]any{},
@@ -309,18 +310,31 @@ func (c *Client) Initialize(ctx context.Context, endpoint string) (InitializeRes
}
var payload InitializeResult
- if err := c.callJSONRPC(ctx, endpoint, requestEnvelope{
+ err := c.callJSONRPC(ctx, endpoint, requestEnvelope{
JSONRPC: "2.0",
ID: 1,
Method: "initialize",
Params: params,
- }, true, &payload); err == nil {
+ }, true, &payload)
+ if err == nil {
if payload.ProtocolVersion == "" {
payload.ProtocolVersion = version
}
payload.RequestedProtocolVersion = version
return payload, nil
}
+ lastErr = err
+ // Only protocol-level JSON-RPC errors justify trying another version.
+ // Transport/HTTP failures will not be fixed by renegotiating, so short
+ // circuit to avoid multiplying dial timeouts (e.g. 3 × 3s on an
+ // unreachable endpoint).
+ var callErr *CallError
+ if !errors.As(err, &callErr) || callErr.Stage != CallStageJSONRPC {
+ return InitializeResult{}, err
+ }
+ }
+ if lastErr != nil {
+ return InitializeResult{}, lastErr
}
return InitializeResult{}, apperrors.NewDiscovery(fmt.Sprintf(\"initialize failed for all supported protocol versions at %s\", RedactURL(endpoint)))
}共 2 个文件,+20/-3 行。本地已验证:
go test ./internal/transport/... ./internal/discovery/...全绿(含现有的TestInitializeNegotiatesProtocolVersion,因为它走的是 JSON-RPC-32600错误路径,Stage=jsonrpc,不受短路影响)- 实测
dws --help从 9.1s 降到 4.1s
四、另一个提醒:rebase
PR 当前 base (31eb109) 已落后 main 4 个 merge(包括 #120 的 bugfix/plugin-system、#117 等),建议合并前 rebase 到最新 main。我本地 cherry-pick 到 origin/main 时无冲突,但为了 CI 绿灯和 history 干净还是建议你 rebase 一下。
五、结论
PR 本身方向 100% 正确,但收益被版本循环稀释到 ~1s。希望能在本 PR 里顺手把上面两个小改动合进来——一次闭环,~7s 的真实提升。如果你更希望保持本 PR 的最小颗粒度先合入,我也完全理解,那边的补丁我可以另开 PR。
你定。🙏
(Reviewed by @PeterGuy326 / 修雨 — cherry-picked onto latest main locally, measured end-to-end on darwin/arm64)
|
Done, pushed the fix. |
|
@PeterGuy326 I did NOT review this. I don't know this project. You've got the wrong personl |
…achable Issue #119: an unreachable third-party MCP plugin (e.g. blocked/firewalled endpoint) blocks `dws --help` for ~10s on every CLI invocation, because plugin discovery happens eagerly during command-tree construction. Root cause was a stack of timeouts that multiplied under transport failure: * `transport.Client.Initialize` loops three supported protocol versions on every error — including dial timeouts and HTTP 5xx — even though those failure modes are protocol-version-independent. Three loops × the dial budget = 3× the worst-case startup cost. * Default `DialContext.Timeout` was 10s, so a single dial against an unroutable address (e.g. TEST-NET-1) burned the full plugin context. * `registerHTTPServer` granted plugins with `AuthHeaders` a 10s outer context (intended for slow third-party services), and `registerStdioServer` granted every stdio plugin 10s. Either single misbehaving plugin therefore stalled the entire CLI. * Registry-side `defaultDiscoveryTimeout` (10s) and `perServerDiscoveryTimeout` (5s) had the same shape on the LoadCatalog path. Changes: * `transport.Client.Initialize`: short-circuit when the underlying `*CallError.Stage` is anything other than `CallStageJSONRPC`. Protocol version negotiation is the only justification for retrying with another version; transport/HTTP failures fail identically and should surrender. * `transport.defaultTransport`: `DialContext.Timeout` 10s → 3s. * `app.registerHTTPServer`: AuthHeaders timeout 10s → 4s. * `app.registerStdioServer`: outer ctx timeout 10s → 4s. * `cli.defaultDiscoveryTimeout`: 10s → 4s. * `discovery.Service`: rename `perServerDiscoveryTimeout` 5s → 2s (`defaultPerServerDiscoveryTimeout`); add `Service.PerServerTimeout` override field for tests/callers needing a tighter or looser bound. Tests: * `TestInitializeShortCircuitsOnHTTPError` — asserts only ONE protocol version is attempted on HTTP 5xx. * `TestInitializeShortCircuitsOnDialFailure` — asserts Initialize returns in <2s on a refused-connection address. * Existing `TestInitializeNegotiatesProtocolVersion` continues to pass — JSON-RPC-stage errors still trigger version fallback. End-to-end measurement with a blackhole plugin (endpoint=192.0.2.1, AuthHeaders set): | Variant | median `dws --help` | | ------------- | ------------------- | | main | 10.2s | | PR #121 alone | 9.1s | | this branch | 3.72s (-63%) | The `loader.go` and `discovery/service.go` timeout reductions overlap with PR #121 (#121) by @utafrali; crediting via Co-authored-by. Refs: #119 Supersedes: #121 Co-authored-by: utafrali <8383373+utafrali@users.noreply.github.com>
|
Done, pushed the fix. |
|
@utafrali 同步一下:#119 已经在 #125 修好了。 #121 单独把 timeout 改小(10s→4s)只能省 ~1s,根因在 端到端实测(blackhole plugin,AuthHeaders 路径):
感谢你 raise 这个问题,给了排查方向 🙏 #121 怎么处理你来定。 |
…achable Issue #119: an unreachable third-party MCP plugin (e.g. blocked/firewalled endpoint) blocks `dws --help` for ~10s on every CLI invocation, because plugin discovery happens eagerly during command-tree construction. Root cause was a stack of timeouts that multiplied under transport failure: * `transport.Client.Initialize` loops three supported protocol versions on every error — including dial timeouts and HTTP 5xx — even though those failure modes are protocol-version-independent. Three loops × the dial budget = 3× the worst-case startup cost. * Default `DialContext.Timeout` was 10s, so a single dial against an unroutable address (e.g. TEST-NET-1) burned the full plugin context. * `registerHTTPServer` granted plugins with `AuthHeaders` a 10s outer context (intended for slow third-party services), and `registerStdioServer` granted every stdio plugin 10s. Either single misbehaving plugin therefore stalled the entire CLI. * Registry-side `defaultDiscoveryTimeout` (10s) and `perServerDiscoveryTimeout` (5s) had the same shape on the LoadCatalog path. Changes: * `transport.Client.Initialize`: short-circuit when the underlying `*CallError.Stage` is anything other than `CallStageJSONRPC`. Protocol version negotiation is the only justification for retrying with another version; transport/HTTP failures fail identically and should surrender. * `transport.defaultTransport`: `DialContext.Timeout` 10s → 3s. * `app.registerHTTPServer`: AuthHeaders timeout 10s → 4s. * `app.registerStdioServer`: outer ctx timeout 10s → 4s. * `cli.defaultDiscoveryTimeout`: 10s → 4s. * `discovery.Service`: rename `perServerDiscoveryTimeout` 5s → 2s (`defaultPerServerDiscoveryTimeout`); add `Service.PerServerTimeout` override field for tests/callers needing a tighter or looser bound. Tests: * `TestInitializeShortCircuitsOnHTTPError` — asserts only ONE protocol version is attempted on HTTP 5xx. * `TestInitializeShortCircuitsOnDialFailure` — asserts Initialize returns in <2s on a refused-connection address. * Existing `TestInitializeNegotiatesProtocolVersion` continues to pass — JSON-RPC-stage errors still trigger version fallback. End-to-end measurement with a blackhole plugin (endpoint=192.0.2.1, AuthHeaders set): | Variant | median `dws --help` | | ------------- | ------------------- | | main | 10.2s | | PR #121 alone | 9.1s | | this branch | 3.72s (-63%) | The `loader.go` and `discovery/service.go` timeout reductions overlap with PR #121 by @utafrali; see PR description for attribution. Refs: #119 Supersedes: #121
|
Done, pushed the fix. |
Fixes #119
When MCP plugins are unreachable, the discovery takes forever and blocks everything. Brought down the overall timeout from 10s to 4s and per-server timeout from 5s to 2s. Added a PerServerTimeout field to make it configurable too.
Added tests for the timeout behavior. Build, lint, test, and policy checks all pass. Verified with check-generated-drift.sh and check-command-surface.sh.