Skip to content

feat: thread-safe SSH/NETCONF connection pooling#27

Open
h-network wants to merge 3 commits intoJuniper:mainfrom
h-network:connection-pooling
Open

feat: thread-safe SSH/NETCONF connection pooling#27
h-network wants to merge 3 commits intoJuniper:mainfrom
h-network:connection-pooling

Conversation

@h-network
Copy link
Copy Markdown

feat: thread-safe SSH/NETCONF connection pooling

Summary

Reuse SSH/NETCONF sessions across tool invocations instead of opening
and tearing down a connection per call.

  • Thread-safe ConnectionPool with per-router locking
  • Auto-detection and reconnection of stale sessions
  • Background cleanup of idle connections (configurable timeout)
  • Graceful shutdown on server exit

Performance

Tested on 5 Junos routers, 7 commands each (35 operations):

Mode Time Speedup
Original 24.5s baseline
Pooled + sequential 4.0s 6.2x
Pooled + parallel 1.2s 20.7x

Benchmark methodology and reproduction steps: https://github.com/h-network/junos-mcp-server/tree/h-dev/benchmark

Closes #26

  Reuse SSH sessions across tool invocations instead of connect/teardown
  per call. Thread-safe ConnectionPool with per-router locking, stale
  detection, idle cleanup, and graceful shutdown.

  Benchmarked: 6.2x sequential, 20.7x parallel (5 routers, 35 ops).
Comment thread jmcp.py Outdated
self._pool_lock = threading.Lock()
if idle_timeout is None:
env_val = os.getenv("JMCP_POOL_IDLE_TIMEOUT")
self._idle_timeout = int(env_val) if env_val else 300
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please wrap it around try catch to catch ValueError? Look at get_timeout_with_fallback.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nileshsimaria updated per your review, addressed your feedback in the latest commit, added error handling for invalid environment variable values.
Ready for re-review.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @h-network Thanks. I am re-reviewing it and it looks like we need to fix one more thing before merge. The issue I see is using threading.Lock in async codebase. As you may have noticed, our codebase (jmcp.py) is fully async. the MCP event loop is single async thread that handles all tool calls. When PyEZ wants to do blocking SSH work, we use anyio.to_thread.run_sync() to push it to background thread (you can see this in execute_junos_command_batch).

The connection pool introduces threading.Lock and I think that will cause some issue because the main async thread will block on it and the server wont be able to do anything till it gets the lock.

Here's a concrete scenario:

Step 1: A batch command is running on Router A. It called get_connection("routerA"), acquired the per-router threading.Lock, and is now doing SSH work in a background thread.

Step 2: While that's running, a single execute_junos_command call arrives for the same Router A. This runs directly on the async event loop. It calls get_connection("routerA"), which hits entry["lock"].acquire().

Step 3: That acquire() is a blocking call. The event loop thread is now frozen — waiting for the background thread from Step 1 to finish and release the lock.

Step 4: While the event loop is frozen, nothing else works. No other routers can be contacted, no progress notifications are sent, no new requests are handled. The entire server is hung until that one SSH command finishes.

If we deploy Junos MCP server with stremmable-http transport, multiple mcp clients can connect with it simultaneously and that's where I think we have high chances of hitting above said scenario.

IMO to fix this we will have to replace threading primitives with asyncio equivalents so the main event loop thread does not block. It will be able to handle some other MCP calls while its async wait on lock.

What do you think ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nileshsimaria Thanks for the review. You're right, I traced the call sites and some of them do call get_connection() directly from async context without anyio.to_thread.run_sync(), which would block the event loop.

I'll try to match the pattern used in the command execution paths, and run benchmarks for both scenarios to validate.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nileshsimaria Thanks for the review. You're right — traced the call sites and several handlers were calling connection_pool.get_connection() directly from async context without anyio.to_thread.run_sync(), which blocked the event loop. Patched four sites to match the pattern used in the command-batch path; left render_and_apply_j2_template for a separate PR (it interleaves awaits inside the connection block and needs a small refactor first).

To validate, ran the change against a 5-router vMX lab (OSPF area 0, ~1250 advertised statics) across four categories:

# Scenario Cat. Bad-path Good-path / health Verdict
1 ruff / bandit / mypy on changed lines Static PASS
2 Reproduction: 5× concurrent gather_device_facts Stress total 9.75s → 2.27s, probe 6ms PASS
3 Same-router stampede (20× cli to r1) Stress 20/20 ok, probe median 6ms PASS
4 Mixed-tool storm (50–100 concurrent) Stress 100/100 ok, probe median 5–6ms PASS
5 Sustained 20s burst load Stress 50/50 across 10 bursts, probe 6ms PASS
6 Heavy real cmd fanout (5× show route extensive) Stress 2.97s (would be ~12.5s serialized) PASS
7 In-block exception (bad CLI cmd) Chaos error returned, lock released 10/10 follow-ups ok PASS
8 SIGTERM with no in-flight load Chaos server exits in 0.31s PASS
9 Stale connection (kill pool's sshd) Chaos pool detects not device.connected reconnects in 0.67s, valid data PASS
10 F1: connection refused (port 23) Fault error in 6s parallel good call returns valid PASS
11 F2: auth failure (missing key) Fault error in 0.14s parallel good call returns valid PASS
12 F3: host unreachable (30s TCP timeout) Fault error in 30s parallel good call returns valid PASS

The probe in the stress rows fires get_router_list every 50ms during the workload to measure event-loop responsiveness. Pre-fix, those probes saw ~5s median latency (event loop starved); post-fix they stay at ~6ms. F3 is the strongest fault-path evidence: a worker thread blocked on a 30-second TCP connect, parallel calls completely unaffected.

One out-of-scope finding worth flagging separately: with an active streamable-http client session, the server doesn't honor SIGTERM gracefully — uvicorn's lifespan-shutdown waits for the persistent HTTP session that never drains, falling back to SIGKILL. Reproduces against the unmodified PR code, so this is pre-existing. Suggest a follow-up to set timeout_graceful_shutdown in the uvicorn config or force-close the session manager in the SIGTERM handler before sys.exit(0) (jmcp.py:2517-2524).

Happy to share the validation scripts if useful for adding to the project's pre-merge checks.

Wrap int() parsing of JMCP_POOL_IDLE_TIMEOUT in try/except ValueError, log a warning, and fall back to the 300s default — mirrors the pattern in get_timeout_with_fallback.
Addresses review feedback on Juniper#27.
Addresses the event-loop-blocking concern raised on this PR. Four
handlers patched to dispatch the sync ConnectionPool.get_connection()
call through anyio.to_thread.run_sync — same pattern already used in
handle_execute_junos_command_batch:

- handle_execute_junos_command          (jmcp.py:1163)
- handle_gather_device_facts            (jmcp.py:1810, _gather_facts_sync)
- handle_execute_junos_pfe_command      (jmcp.py:1963)
- handle_load_and_commit_config         (jmcp.py:2023, _load_and_commit_sync)

handle_render_and_apply_j2_template (line 1632) intentionally not
included — has interleaved `await context.info(...)` inside the
connection block; deferred to a separate PR.

Validated against a 5-router vMX lab. Detail in the PR comment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Connection pooling: performance improvement for SSH/NETCONF sessions

2 participants