feat: thread-safe SSH/NETCONF connection pooling#27
feat: thread-safe SSH/NETCONF connection pooling#27h-network wants to merge 3 commits intoJuniper:mainfrom
Conversation
Reuse SSH sessions across tool invocations instead of connect/teardown per call. Thread-safe ConnectionPool with per-router locking, stale detection, idle cleanup, and graceful shutdown. Benchmarked: 6.2x sequential, 20.7x parallel (5 routers, 35 ops).
| self._pool_lock = threading.Lock() | ||
| if idle_timeout is None: | ||
| env_val = os.getenv("JMCP_POOL_IDLE_TIMEOUT") | ||
| self._idle_timeout = int(env_val) if env_val else 300 |
There was a problem hiding this comment.
Can you please wrap it around try catch to catch ValueError? Look at get_timeout_with_fallback.
There was a problem hiding this comment.
@nileshsimaria updated per your review, addressed your feedback in the latest commit, added error handling for invalid environment variable values.
Ready for re-review.
There was a problem hiding this comment.
Hi @h-network Thanks. I am re-reviewing it and it looks like we need to fix one more thing before merge. The issue I see is using threading.Lock in async codebase. As you may have noticed, our codebase (jmcp.py) is fully async. the MCP event loop is single async thread that handles all tool calls. When PyEZ wants to do blocking SSH work, we use anyio.to_thread.run_sync() to push it to background thread (you can see this in execute_junos_command_batch).
The connection pool introduces threading.Lock and I think that will cause some issue because the main async thread will block on it and the server wont be able to do anything till it gets the lock.
Here's a concrete scenario:
Step 1: A batch command is running on Router A. It called get_connection("routerA"), acquired the per-router threading.Lock, and is now doing SSH work in a background thread.
Step 2: While that's running, a single execute_junos_command call arrives for the same Router A. This runs directly on the async event loop. It calls get_connection("routerA"), which hits entry["lock"].acquire().
Step 3: That acquire() is a blocking call. The event loop thread is now frozen — waiting for the background thread from Step 1 to finish and release the lock.
Step 4: While the event loop is frozen, nothing else works. No other routers can be contacted, no progress notifications are sent, no new requests are handled. The entire server is hung until that one SSH command finishes.
If we deploy Junos MCP server with stremmable-http transport, multiple mcp clients can connect with it simultaneously and that's where I think we have high chances of hitting above said scenario.
IMO to fix this we will have to replace threading primitives with asyncio equivalents so the main event loop thread does not block. It will be able to handle some other MCP calls while its async wait on lock.
What do you think ?
There was a problem hiding this comment.
@nileshsimaria Thanks for the review. You're right, I traced the call sites and some of them do call get_connection() directly from async context without anyio.to_thread.run_sync(), which would block the event loop.
I'll try to match the pattern used in the command execution paths, and run benchmarks for both scenarios to validate.
There was a problem hiding this comment.
@nileshsimaria Thanks for the review. You're right — traced the call sites and several handlers were calling connection_pool.get_connection() directly from async context without anyio.to_thread.run_sync(), which blocked the event loop. Patched four sites to match the pattern used in the command-batch path; left render_and_apply_j2_template for a separate PR (it interleaves awaits inside the connection block and needs a small refactor first).
To validate, ran the change against a 5-router vMX lab (OSPF area 0, ~1250 advertised statics) across four categories:
| # | Scenario | Cat. | Bad-path | Good-path / health | Verdict |
|---|---|---|---|---|---|
| 1 | ruff / bandit / mypy on changed lines | Static | — | — | PASS |
| 2 | Reproduction: 5× concurrent gather_device_facts |
Stress | — | total 9.75s → 2.27s, probe 6ms | PASS |
| 3 | Same-router stampede (20× cli to r1) | Stress | — | 20/20 ok, probe median 6ms | PASS |
| 4 | Mixed-tool storm (50–100 concurrent) | Stress | — | 100/100 ok, probe median 5–6ms | PASS |
| 5 | Sustained 20s burst load | Stress | — | 50/50 across 10 bursts, probe 6ms | PASS |
| 6 | Heavy real cmd fanout (5× show route extensive) |
Stress | — | 2.97s (would be ~12.5s serialized) | PASS |
| 7 | In-block exception (bad CLI cmd) | Chaos | error returned, lock released | 10/10 follow-ups ok | PASS |
| 8 | SIGTERM with no in-flight load | Chaos | server exits in 0.31s | — | PASS |
| 9 | Stale connection (kill pool's sshd) | Chaos | pool detects not device.connected |
reconnects in 0.67s, valid data | PASS |
| 10 | F1: connection refused (port 23) | Fault | error in 6s | parallel good call returns valid | PASS |
| 11 | F2: auth failure (missing key) | Fault | error in 0.14s | parallel good call returns valid | PASS |
| 12 | F3: host unreachable (30s TCP timeout) | Fault | error in 30s | parallel good call returns valid | PASS |
The probe in the stress rows fires get_router_list every 50ms during the workload to measure event-loop responsiveness. Pre-fix, those probes saw ~5s median latency (event loop starved); post-fix they stay at ~6ms. F3 is the strongest fault-path evidence: a worker thread blocked on a 30-second TCP connect, parallel calls completely unaffected.
One out-of-scope finding worth flagging separately: with an active streamable-http client session, the server doesn't honor SIGTERM gracefully — uvicorn's lifespan-shutdown waits for the persistent HTTP session that never drains, falling back to SIGKILL. Reproduces against the unmodified PR code, so this is pre-existing. Suggest a follow-up to set timeout_graceful_shutdown in the uvicorn config or force-close the session manager in the SIGTERM handler before sys.exit(0) (jmcp.py:2517-2524).
Happy to share the validation scripts if useful for adding to the project's pre-merge checks.
Wrap int() parsing of JMCP_POOL_IDLE_TIMEOUT in try/except ValueError, log a warning, and fall back to the 300s default — mirrors the pattern in get_timeout_with_fallback. Addresses review feedback on Juniper#27.
Addresses the event-loop-blocking concern raised on this PR. Four handlers patched to dispatch the sync ConnectionPool.get_connection() call through anyio.to_thread.run_sync — same pattern already used in handle_execute_junos_command_batch: - handle_execute_junos_command (jmcp.py:1163) - handle_gather_device_facts (jmcp.py:1810, _gather_facts_sync) - handle_execute_junos_pfe_command (jmcp.py:1963) - handle_load_and_commit_config (jmcp.py:2023, _load_and_commit_sync) handle_render_and_apply_j2_template (line 1632) intentionally not included — has interleaved `await context.info(...)` inside the connection block; deferred to a separate PR. Validated against a 5-router vMX lab. Detail in the PR comment.
feat: thread-safe SSH/NETCONF connection pooling
Summary
Reuse SSH/NETCONF sessions across tool invocations instead of opening
and tearing down a connection per call.
ConnectionPoolwith per-router lockingPerformance
Tested on 5 Junos routers, 7 commands each (35 operations):
Benchmark methodology and reproduction steps: https://github.com/h-network/junos-mcp-server/tree/h-dev/benchmark
Closes #26