feat: thread-safe SSH/NETCONF connection pooling by h-network · Pull Request #27 · Juniper/junos-mcp-server

h-network · 2026-04-02T08:13:27Z

feat: thread-safe SSH/NETCONF connection pooling

Summary

Reuse SSH/NETCONF sessions across tool invocations instead of opening
and tearing down a connection per call.

Thread-safe ConnectionPool with per-router locking
Auto-detection and reconnection of stale sessions
Background cleanup of idle connections (configurable timeout)
Graceful shutdown on server exit

Performance

Tested on 5 Junos routers, 7 commands each (35 operations):

Mode	Time	Speedup
Original	24.5s	baseline
Pooled + sequential	4.0s	6.2x
Pooled + parallel	1.2s	20.7x

Benchmark methodology and reproduction steps: https://github.com/h-network/junos-mcp-server/tree/h-dev/benchmark

Closes #26

Reuse SSH sessions across tool invocations instead of connect/teardown per call. Thread-safe ConnectionPool with per-router locking, stale detection, idle cleanup, and graceful shutdown. Benchmarked: 6.2x sequential, 20.7x parallel (5 routers, 35 ops).

nileshsimaria · 2026-04-02T19:00:16Z

+        self._pool_lock = threading.Lock()
+        if idle_timeout is None:
+            env_val = os.getenv("JMCP_POOL_IDLE_TIMEOUT")
+            self._idle_timeout = int(env_val) if env_val else 300


Can you please wrap it around try catch to catch ValueError? Look at get_timeout_with_fallback.

@nileshsimaria updated per your review, addressed your feedback in the latest commit, added error handling for invalid environment variable values.
Ready for re-review.

Hi @h-network Thanks. I am re-reviewing it and it looks like we need to fix one more thing before merge. The issue I see is using threading.Lock in async codebase. As you may have noticed, our codebase (jmcp.py) is fully async. the MCP event loop is single async thread that handles all tool calls. When PyEZ wants to do blocking SSH work, we use anyio.to_thread.run_sync() to push it to background thread (you can see this in execute_junos_command_batch).

The connection pool introduces threading.Lock and I think that will cause some issue because the main async thread will block on it and the server wont be able to do anything till it gets the lock.

Here's a concrete scenario:

Step 1: A batch command is running on Router A. It called get_connection("routerA"), acquired the per-router threading.Lock, and is now doing SSH work in a background thread.

Step 2: While that's running, a single execute_junos_command call arrives for the same Router A. This runs directly on the async event loop. It calls get_connection("routerA"), which hits entry["lock"].acquire().

Step 3: That acquire() is a blocking call. The event loop thread is now frozen — waiting for the background thread from Step 1 to finish and release the lock.

Step 4: While the event loop is frozen, nothing else works. No other routers can be contacted, no progress notifications are sent, no new requests are handled. The entire server is hung until that one SSH command finishes.

If we deploy Junos MCP server with stremmable-http transport, multiple mcp clients can connect with it simultaneously and that's where I think we have high chances of hitting above said scenario.

IMO to fix this we will have to replace threading primitives with asyncio equivalents so the main event loop thread does not block. It will be able to handle some other MCP calls while its async wait on lock.

What do you think ?

@nileshsimaria Thanks for the review. You're right, I traced the call sites and some of them do call get_connection() directly from async context without anyio.to_thread.run_sync(), which would block the event loop.

I'll try to match the pattern used in the command execution paths, and run benchmarks for both scenarios to validate.

@nileshsimaria Thanks for the review. You're right — traced the call sites and several handlers were calling connection_pool.get_connection() directly from async context without anyio.to_thread.run_sync(), which blocked the event loop. Patched four sites to match the pattern used in the command-batch path; left render_and_apply_j2_template for a separate PR (it interleaves awaits inside the connection block and needs a small refactor first).

To validate, ran the change against a 5-router vMX lab (OSPF area 0, ~1250 advertised statics) across four categories:

# Scenario Cat. Bad-path Good-path / health Verdict

1 ruff / bandit / mypy on changed lines Static — — PASS

2 Reproduction: 5× concurrent gather_device_facts Stress — total 9.75s → 2.27s, probe 6ms PASS

3 Same-router stampede (20× cli to r1) Stress — 20/20 ok, probe median 6ms PASS

4 Mixed-tool storm (50–100 concurrent) Stress — 100/100 ok, probe median 5–6ms PASS

5 Sustained 20s burst load Stress — 50/50 across 10 bursts, probe 6ms PASS

6 Heavy real cmd fanout (5× show route extensive) Stress — 2.97s (would be ~12.5s serialized) PASS

7 In-block exception (bad CLI cmd) Chaos error returned, lock released 10/10 follow-ups ok PASS

8 SIGTERM with no in-flight load Chaos server exits in 0.31s — PASS

9 Stale connection (kill pool's sshd) Chaos pool detects not device.connected reconnects in 0.67s, valid data PASS

10 F1: connection refused (port 23) Fault error in 6s parallel good call returns valid PASS

11 F2: auth failure (missing key) Fault error in 0.14s parallel good call returns valid PASS

12 F3: host unreachable (30s TCP timeout) Fault error in 30s parallel good call returns valid PASS

The probe in the stress rows fires get_router_list every 50ms during the workload to measure event-loop responsiveness. Pre-fix, those probes saw ~5s median latency (event loop starved); post-fix they stay at ~6ms. F3 is the strongest fault-path evidence: a worker thread blocked on a 30-second TCP connect, parallel calls completely unaffected.

One out-of-scope finding worth flagging separately: with an active streamable-http client session, the server doesn't honor SIGTERM gracefully — uvicorn's lifespan-shutdown waits for the persistent HTTP session that never drains, falling back to SIGKILL. Reproduces against the unmodified PR code, so this is pre-existing. Suggest a follow-up to set timeout_graceful_shutdown in the uvicorn config or force-close the session manager in the SIGTERM handler before sys.exit(0) (jmcp.py:2517-2524).

Happy to share the validation scripts if useful for adding to the project's pre-merge checks.

Wrap int() parsing of JMCP_POOL_IDLE_TIMEOUT in try/except ValueError, log a warning, and fall back to the 300s default — mirrors the pattern in get_timeout_with_fallback. Addresses review feedback on Juniper#27.

Addresses the event-loop-blocking concern raised on this PR. Four handlers patched to dispatch the sync ConnectionPool.get_connection() call through anyio.to_thread.run_sync — same pattern already used in handle_execute_junos_command_batch: - handle_execute_junos_command (jmcp.py:1163) - handle_gather_device_facts (jmcp.py:1810, _gather_facts_sync) - handle_execute_junos_pfe_command (jmcp.py:1963) - handle_load_and_commit_config (jmcp.py:2023, _load_and_commit_sync) handle_render_and_apply_j2_template (line 1632) intentionally not included — has interleaved `await context.info(...)` inside the connection block; deferred to a separate PR. Validated against a 5-router vMX lab. Detail in the PR comment.

nileshsimaria reviewed Apr 2, 2026

View reviewed changes

fix: handle invalid JMCP_POOL_IDLE_TIMEOUT gracefully

34a6359

Wrap int() parsing of JMCP_POOL_IDLE_TIMEOUT in try/except ValueError, log a warning, and fall back to the 300s default — mirrors the pattern in get_timeout_with_fallback. Addresses review feedback on Juniper#27.

shigechika mentioned this pull request Apr 16, 2026

feat: per-host SSH/NETCONF connection pool with idle timeout shigechika/junos-mcp#5

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: thread-safe SSH/NETCONF connection pooling#27

feat: thread-safe SSH/NETCONF connection pooling#27
h-network wants to merge 3 commits intoJuniper:mainfrom
h-network:connection-pooling

h-network commented Apr 2, 2026

Uh oh!

nileshsimaria Apr 2, 2026

Uh oh!

h-network Apr 15, 2026

Uh oh!

nileshsimaria Apr 17, 2026

Uh oh!

h-network Apr 17, 2026

Uh oh!

h-network Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

#	Scenario	Cat.	Bad-path	Good-path / health	Verdict
1	ruff / bandit / mypy on changed lines	Static	—	—	PASS
2	Reproduction: 5× concurrent `gather_device_facts`	Stress	—	total 9.75s → 2.27s, probe 6ms	PASS
3	Same-router stampede (20× cli to r1)	Stress	—	20/20 ok, probe median 6ms	PASS
4	Mixed-tool storm (50–100 concurrent)	Stress	—	100/100 ok, probe median 5–6ms	PASS
5	Sustained 20s burst load	Stress	—	50/50 across 10 bursts, probe 6ms	PASS
6	Heavy real cmd fanout (5× `show route extensive`)	Stress	—	2.97s (would be ~12.5s serialized)	PASS
7	In-block exception (bad CLI cmd)	Chaos	error returned, lock released	10/10 follow-ups ok	PASS
8	SIGTERM with no in-flight load	Chaos	server exits in 0.31s	—	PASS
9	Stale connection (kill pool's sshd)	Chaos	pool detects `not device.connected`	reconnects in 0.67s, valid data	PASS
10	F1: connection refused (port 23)	Fault	error in 6s	parallel good call returns valid	PASS
11	F2: auth failure (missing key)	Fault	error in 0.14s	parallel good call returns valid	PASS
12	F3: host unreachable (30s TCP timeout)	Fault	error in 30s	parallel good call returns valid	PASS

Conversation

h-network commented Apr 2, 2026

Summary

Performance

Uh oh!

nileshsimaria Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

h-network Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

nileshsimaria Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

h-network Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

h-network Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants