fix: queue telegram bridge messages per chat to prevent session lock contention by Junior00619 · Pull Request #860 · NVIDIA/NemoClaw

Junior00619 · 2026-03-25T03:44:30Z

Fixes #831

Problem
Concurrent inbound messages for the same Telegram chat each trigger an independent runAgentInSandbox call keyed on the same session ID (tg-). Because the underlying session store uses file-level locking, overlapping writes from parallel SSH+agent processes race on the lock and surface as session file locked (timeout 10000ms) errors.

Root Cause
The poll loop dispatches agent calls inline with await, but the for...of iteration over updates only serializes messages within a single polling batch. Under sustained load, the next getUpdates batch can begin processing before prior agent calls resolve, producing concurrent sandbox processes for the same chat.

Fix
A per-chat promise chain (chatQueues: Map<string, Promise>) gates runAgentInSandbox so at most one invocation is in-flight per chat at any time. Subsequent messages for the same chat are appended to the chain via .then(job, job) — the rejection handler ensures the queue drains even if an individual job throws. Cross-chat concurrency is unaffected.

Cleanup is handled in .finally(): the map entry is removed only when the stored reference matches the completing promise, preventing a late-settling chain from stomping a freshly enqueued one. The /reset command now also evicts the queue entry so a user reset doesn't block behind a stale in-flight call.

Summary by CodeRabbit

New Features
- Per-chat serialized queuing with a maximum queue depth; when full users receive a "Still processing, please wait." reply.
Bug Fixes
- Messages for each Telegram chat are processed sequentially so one completes before the next starts.
- Typing indicator is shown during processing and cleared when a response or error is delivered.
- /reset clears pending queued messages in addition to session history.
Tests
- Added tests validating per-chat queueing, depth tracking, isolation across chats, reset/race handling, and the queue depth limit.

coderabbitai · 2026-03-25T03:44:46Z

📝 Walkthrough

Walkthrough

Startup config and env validation were moved into a new init(); per-chat Promise-based queues (chatQueues) with bounded depths (chatQueueDepths, MAX_QUEUE_DEPTH) serialize agent runs per chat and avoid session file lock contention. /reset clears queued state and bumps chatEpochs. Exports updated for testing.

Changes

Cohort / File(s)	Summary
Telegram bridge core `scripts/telegram-bridge.js`	Moved eager startup into `init()`; deferred `resolveOpenshell()` and env validation to runtime; added per-chat Promise chains (`chatQueues`) and depth tracking (`chatQueueDepths`, `MAX_QUEUE_DEPTH`) to serialize jobs; enqueueing sends typing, runs sandbox agent with epoch-aware session IDs, clears typing, and replies with original `message_id`; `/reset` clears queues/depths and increments `chatEpochs`; conditional entrypoint (`if (require.main === module) { init(); main(); }`) and exports now include `chatQueues`, `chatQueueDepths`, `chatEpochs`, `MAX_QUEUE_DEPTH`.
Tests `test/telegram-bridge-queue.test.js`	New Vitest file exercising queue semantics: verifies `MAX_QUEUE_DEPTH`, sequential execution per `chatId`, isolation across `chatId`s, `chatQueueDepths` mutation and capping behavior, and a `/reset` race scenario ensuring new jobs run with updated epoch while old jobs complete without overlap.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant User as Telegram User
    participant TG as Telegram Bot API
    participant Bridge as telegram-bridge.js
    participant Agent as OpenClaw Sandbox
    participant Store as Session Store

    User->>TG: send message
    TG->>Bridge: deliver update (chatId, message_id, text)
    Bridge->>Bridge: enqueue job in chatQueues[chatId]
    Bridge->>TG: send typing action
    Bridge->>Agent: runAgentInSandbox(text, sessionId_with_epoch)
    Agent->>Store: read/write session file
    Agent-->>Bridge: agent response / error
    Bridge->>TG: send reply (using original message_id)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰
I tuck each message in a neat little queue,
Hop-step, typing dots—I'm working for you.
Epoch bumped, old jobs fade away,
New carrots served without disarray.
Thump! A reply — fresh, calm, and true. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and accurately summarizes the primary change: implementing per-chat message queuing to prevent session lock contention in the Telegram bridge.
Linked Issues check	✅ Passed	The PR implements per-chat message queuing via chatQueues that serializes agent invocations per chat, directly addressing all coding requirements in issue `#831` to prevent concurrent session file lock contention.
Out of Scope Changes check	✅ Passed	All changes are directly related to implementing per-chat message queuing and preventing session lock contention; the test file validates the queueing mechanism and epoch-based reset isolation as specified.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Junior00619 · 2026-03-26T01:37:14Z

@cv friendly ping , this is ready for review whenever you have a moment 🙏

prekshivyas

@Junior00619 Correct fix — Promise-chain queuing per chat prevents concurrent runAgentInSandbox calls and eliminates the session lock root cause. The prev.then(job, job) pattern and finally cleanup are both correct.

Three things to address:

No tests. This is the more critical fix (root cause vs. aftermath in #862) and should have at least a unit test for queue serialization — e.g., two concurrent messages on the same chatId execute sequentially, not in parallel.
Unbounded queue growth. If a user sends 50 messages while the agent is processing, all 50 queue up with no backpressure. Consider capping queue depth per chat (e.g., 5) and responding with "Still processing, please wait" for messages beyond the limit.
Poll interval. The current 100ms poll is aggressive. Consider raising to 1000ms — Telegram's getUpdates long-poll already handles latency, and 100ms just burns CPU between polls.

wscurran · 2026-03-30T20:02:45Z

✨ Thanks for submitting this PR with a detailed summary, it addresses a bug with the Telegram bridge messages and proposes a fix to improve the performance of NemoClaw, which could enhance the user experience.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

test/telegram-bridge-queue.test.js (1)
13-113: Add a regression test for /reset during an active queued job.

Current coverage validates queue mechanics, but it does not protect the reset-path race (queue eviction while a prior job is still running). A dedicated test for that sequence would prevent regressions of the lock-contention fix.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/telegram-bridge-queue.test.js` around lines 13 - 113, Add a new test
that enqueues a blocked job for a chatId (use chatQueues to chain a Promise that
awaits a resolver) then simulate a /reset while that job is still running by
deleting the chat's entries from chatQueues and chatQueueDepths
(chatQueues.delete(chatId); chatQueueDepths.delete(chatId)); then resolve the
blocked job and await its chain to ensure it completes without throwing and that
the maps no longer contain the chatId; reference chatQueues, chatQueueDepths and
MAX_QUEUE_DEPTH in the test to locate where to hook into the existing tests.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/telegram-bridge-queue.test.js`:
- Around line 13-113: Add a new test that enqueues a blocked job for a chatId
(use chatQueues to chain a Promise that awaits a resolver) then simulate a
/reset while that job is still running by deleting the chat's entries from
chatQueues and chatQueueDepths (chatQueues.delete(chatId);
chatQueueDepths.delete(chatId)); then resolve the blocked job and await its
chain to ensure it completes without throwing and that the maps no longer
contain the chatId; reference chatQueues, chatQueueDepths and MAX_QUEUE_DEPTH in
the test to locate where to hook into the existing tests.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c0492bfe-a90b-457d-9895-6d80f29c22c1

📥 Commits

Reviewing files that changed from the base of the PR and between 1f1be1a and 6de4b04.

📒 Files selected for processing (2)

scripts/telegram-bridge.js
test/telegram-bridge-queue.test.js

scripts/telegram-bridge.js

…contention Replace per-chat cooldown + busyChats rejection with Promise-chain queuing that serializes agent calls per chat ID, preventing concurrent session-file lock collisions. Address reviewer feedback: - Add queue depth cap (MAX_QUEUE_DEPTH=5) with backpressure response - Keep upstream 1000ms poll interval (not 100ms) - Add unit tests for queue serialization and backpressure - Wrap startup validation in init() so the module is import-safe for tests Fixes NVIDIA#831

coderabbitai

🧹 Nitpick comments (1)

scripts/telegram-bridge.js (1)

224-243: Consider moving clearInterval to a single location in finally.

The typingInterval is cleared in both the try block (line 233) and catch block (line 237). While calling clearInterval twice is harmless, consolidating cleanup in finally would be cleaner and ensure the interval is always cleared regardless of control flow.

♻️ Suggested refactor

         const job = async () => {
           // If the session was reset since this job was enqueued, skip it
           // so the old and new session identities never overlap.
           if ((chatEpochs.get(chatId) || 0) !== epoch) return;
           await sendTyping(chatId);
           const typingInterval = setInterval(() => sendTyping(chatId), 4000);
           try {
             const sessionId = epoch > 0 ? `${chatId}-e${epoch}` : chatId;
             const response = await runAgentInSandbox(text, sessionId);
-            clearInterval(typingInterval);
             console.log(`[${chatId}] agent: ${response.slice(0, 100)}...`);
             await sendMessage(chatId, response, messageId);
           } catch (err) {
-            clearInterval(typingInterval);
             await sendMessage(chatId, `Error: ${err.message}`, messageId);
           } finally {
+            clearInterval(typingInterval);
             chatQueueDepths.set(chatId, (chatQueueDepths.get(chatId) || 1) - 1);
             if (chatQueueDepths.get(chatId) <= 0) chatQueueDepths.delete(chatId);
           }
         };

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/telegram-bridge.js` around lines 224 - 243, The typingInterval is
cleared in both the try and catch blocks within the job async function; move the
clearInterval(typingInterval) call into the finally block so the interval is
always cleaned exactly once. Modify the job function: remove clearInterval from
the try and catch, and add clearInterval(typingInterval) at the start of the
existing finally block that updates chatQueueDepths; keep references to
typingInterval, job, runAgentInSandbox, and sendMessage to locate the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@scripts/telegram-bridge.js`:
- Around line 224-243: The typingInterval is cleared in both the try and catch
blocks within the job async function; move the clearInterval(typingInterval)
call into the finally block so the interval is always cleaned exactly once.
Modify the job function: remove clearInterval from the try and catch, and add
clearInterval(typingInterval) at the start of the existing finally block that
updates chatQueueDepths; keep references to typingInterval, job,
runAgentInSandbox, and sendMessage to locate the change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 104bd9eb-6282-446b-8a41-97a1e2b62487

📥 Commits

Reviewing files that changed from the base of the PR and between 6de4b04 and f350b1b.

📒 Files selected for processing (2)

scripts/telegram-bridge.js
test/telegram-bridge-queue.test.js

🚧 Files skipped from review as they are similar to previous changes (1)

test/telegram-bridge-queue.test.js

Junior00619 force-pushed the fix/telegram-bridge-session-lock branch from 2a61512 to 1f1be1a Compare March 25, 2026 15:18

wscurran mentioned this pull request Mar 30, 2026

Telegram bridge status is mis-reported #587

Open

2 tasks

prekshivyas reviewed Mar 30, 2026

View reviewed changes

prekshivyas mentioned this pull request Mar 30, 2026

fix: auto-reset telegram bridge session after lock failure #862

Open

wscurran added bug Something isn't working Integration: Telegram Use this label to identify Telegram bot integration issues with NemoClaw. labels Mar 30, 2026

Junior00619 force-pushed the fix/telegram-bridge-session-lock branch from 1f1be1a to 6de4b04 Compare March 30, 2026 21:13

coderabbitai bot reviewed Mar 30, 2026

View reviewed changes

scripts/telegram-bridge.js Show resolved Hide resolved

Junior00619 force-pushed the fix/telegram-bridge-session-lock branch from 6de4b04 to f350b1b Compare March 30, 2026 21:28

coderabbitai bot reviewed Mar 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: queue telegram bridge messages per chat to prevent session lock contention#860

fix: queue telegram bridge messages per chat to prevent session lock contention#860
Junior00619 wants to merge 1 commit intoNVIDIA:mainfrom
Junior00619:fix/telegram-bridge-session-lock

Junior00619 commented Mar 25, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 25, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

Junior00619 commented Mar 26, 2026

Uh oh!

prekshivyas left a comment •

edited

Loading

Uh oh!

wscurran commented Mar 30, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Junior00619 commented Mar 25, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

Junior00619 commented Mar 26, 2026

Uh oh!

prekshivyas left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wscurran commented Mar 30, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Junior00619 commented Mar 25, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 25, 2026 •

edited

Loading

prekshivyas left a comment •

edited

Loading