fix(onboard): add timeout to spawnSync calls to prevent hung processes by latenighthackathon · Pull Request #1069 · NVIDIA/NemoClaw

latenighthackathon · 2026-03-29T19:52:52Z

Summary

All spawnSync calls in onboard.js lacked a timeout option. A hung curl or stalled download would freeze the entire wizard indefinitely. Added process-level timeouts and timeout-specific error messaging.

Related Issue

Closes #1017

Changes

30s timeout on curl endpoint probes (buffer over curl --max-time 20)
10min timeout on ollama model downloads
5min timeout on install-openshell.sh
SIGTERM detection in pullOllamaModel() for timeout-specific error message

Type of Change

Code change for a new feature, bug fix, or refactor.
Code change with doc updates.
Doc only. Prose changes without code sample modifications.
Doc only. Includes code sample changes.

Testing

npx prek run --all-files passes (or equivalently make check).
npm test passes.
make docs builds without warnings. (for doc-only changes)

Checklist

General

I have read and followed the contributing guide.

Code Changes

Formatters applied.
No secrets, API keys, or credentials committed.

Summary by CodeRabbit

Bug Fixes
- Added execution time limits to remote probes and model list retrievals (short limits ~30s) to prevent hangs.
- Added a medium timeout (~5min) for installer-like operations and a longer timeout (~10min) for large model downloads.
- Improved timeout detection and error reporting so timed-out operations fail gracefully and return clear failure status.

coderabbitai · 2026-03-29T19:53:04Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Added explicit timeouts to several synchronous shell invocations in the onboarding script: remote probes and model-list fetches now time out at 30s; local long-running subprocesses use longer timeouts (ollama pull: 10m, OpenShell install: 5m). pullOllamaModel now treats SIGTERM as a timeout and returns false.

Changes

Cohort / File(s)	Summary
Onboarding timeouts `bin/lib/onboard.js`	Added `timeout: 30_000` to remote endpoint probes and model-list fetches (OpenAI-like, Anthropic, NVIDIA). Increased timeouts for local long-running operations: `pullOllamaModel()` uses `timeout: 600_000` and detects `result.signal === "SIGTERM"` to emit a timeout-specific error and return `false`; `installOpenshell()` uses `timeout: 300_000`. No other logic or return-value changes.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

A rabbit taps a stopwatch bright, 🐇
Bash calls now end within their rhyme,
Long pulls get patience, short probes get light,
No frozen wizards wasting time,
Hops continue — tidy, right! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and specifically describes the main change: adding timeout constraints to spawnSync calls to prevent hung processes.
Linked Issues check	✅ Passed	The changes fully address the requirements in issue `#1017`: timeout properties added to all identified spawnSync calls with appropriate durations (30s for probes, 10-min for ollama, 5-min for installer), SIGTERM detection implemented for timeout-specific error messaging, and the change prevents indefinite process hangs.
Out of Scope Changes check	✅ Passed	All changes are scoped to adding timeout constraints and timeout detection in spawnSync calls as specified in issue `#1017`; no unrelated modifications are present.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

bin/lib/onboard.js (1)

1227-1234: 10-minute timeout added for model pulls.

The 600-second timeout prevents indefinite hanging during ollama pull operations. When a timeout occurs, the function returns false, and the caller displays an error message.

Consider: The error message at Line 1245 doesn't distinguish between timeout and other failures (network errors, invalid model name, etc.). Users might benefit from timeout-specific guidance.

💡 Optional: Improve timeout diagnostics

Add a check for result.signal to provide timeout-specific error messages:

 function pullOllamaModel(model) {
   const result = spawnSync("bash", ["-c", `ollama pull ${shellQuote(model)}`], {
     cwd: ROOT,
     encoding: "utf8",
     stdio: "inherit",
     timeout: 600_000,
     env: { ...process.env },
   });
-  return result.status === 0;
+  if (result.signal === 'SIGTERM' && result.status === null) {
+    return { ok: false, timeout: true };
+  }
+  return { ok: result.status === 0 };
 }

Then update the caller (Line 1241) to check for timeout and provide specific guidance:

-    if (!pullOllamaModel(model)) {
+    const pullResult = pullOllamaModel(model);
+    if (!pullResult.ok) {
+      if (pullResult.timeout) {
+        return {
+          ok: false,
+          message: `Timed out pulling Ollama model '${model}' after 10 minutes. ` +
+            "Large models may need more time. Try: ollama pull ${model} manually, or choose a smaller model."
+        };
+      }
       return {

Note: For very large models (>20GB) on slower connections, the 10-minute timeout might be tight, though this is an edge case.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 1227 - 1234, The current spawnSync("bash",
["-c", `ollama pull ${shellQuote(model)}`]) invocation only returns a boolean
via result.status === 0, which loses timeout vs other-failure diagnostics;
change the function to detect and propagate timeout-specific info by inspecting
the spawnSync result (check result.signal and result.error if present, as well
as result.status) and return or throw a value that distinguishes a timeout
(e.g., result.signal === 'SIGTERM' or a custom status). Then update the caller
that currently treats a false return from this function to check for that
timeout indicator and show a timeout-specific error/help message (suggest
increasing timeout or checking network) while preserving the existing generic
error path for other failures; reference spawnSync, result.signal,
result.status, and the "ollama pull" command to locate the relevant code.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 1227-1234: The current spawnSync("bash", ["-c", `ollama pull
${shellQuote(model)}`]) invocation only returns a boolean via result.status ===
0, which loses timeout vs other-failure diagnostics; change the function to
detect and propagate timeout-specific info by inspecting the spawnSync result
(check result.signal and result.error if present, as well as result.status) and
return or throw a value that distinguishes a timeout (e.g., result.signal ===
'SIGTERM' or a custom status). Then update the caller that currently treats a
false return from this function to check for that timeout indicator and show a
timeout-specific error/help message (suggest increasing timeout or checking
network) while preserving the existing generic error path for other failures;
reference spawnSync, result.signal, result.status, and the "ollama pull" command
to locate the relevant code.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2b26a543-25d5-4344-8342-eef25264cdba

📥 Commits

Reviewing files that changed from the base of the PR and between eb4ba8c and 0f928f7.

📒 Files selected for processing (1)

bin/lib/onboard.js

All spawnSync calls in the onboarding wizard lacked a timeout option, meaning a hung curl or stalled download would freeze the entire wizard with no recovery path. Add process-level timeouts as a safety net: - 30s for curl endpoint probes (10s buffer over curl --max-time 20) - 10min for ollama model downloads - 5min for install-openshell.sh execution Closes NVIDIA#1017

When spawnSync kills the process due to timeout, result.signal is SIGTERM. Surface a specific error message so users know the 10-minute limit was hit, rather than seeing the generic pull failure message. Addresses CodeRabbit review nitpick.

…-timeout

kjw3 · 2026-03-30T17:51:53Z

This looks like a real improvement over main.

I’m comfortable with the core fix here: the missing spawnSync timeouts in onboard.js are a real problem, and this change prevents the onboarding flow from hanging indefinitely on stalled probes, ollama pull, or the OpenShell install path.

One follow-up I’d suggest before or after merge is around the Ollama timeout UX. A 10-minute cap is a reasonable safety valve for onboarding, but it will be too short for some larger models even on decent connections. I don’t think the onboarding flow should be responsible for arbitrarily large model downloads anyway, so the user guidance should make that clearer.

Two reasonable options:

bump the 10-minute timeout if the goal is to support larger in-onboarding pulls
keep the timeout as-is, but improve the message to explicitly suggest pre-pulling large models outside the NemoClaw install/onboard path and then retrying

If we keep the current timeout, I’d prefer the second option, since it matches a cleaner product story for onboarding.

NVIDIA#1069) * fix(onboard): add timeout to spawnSync calls to prevent hung processes All spawnSync calls in the onboarding wizard lacked a timeout option, meaning a hung curl or stalled download would freeze the entire wizard with no recovery path. Add process-level timeouts as a safety net: - 30s for curl endpoint probes (10s buffer over curl --max-time 20) - 10min for ollama model downloads - 5min for install-openshell.sh execution Closes NVIDIA#1017 * fix(onboard): distinguish timeout from other failures in ollama pull When spawnSync kills the process due to timeout, result.signal is SIGTERM. Surface a specific error message so users know the 10-minute limit was hit, rather than seeing the generic pull failure message. Addresses CodeRabbit review nitpick. --------- Co-authored-by: latenighthackathon <latenighthackathon@users.noreply.github.com> Co-authored-by: KJ <kejones@nvidia.com>

coderabbitai bot reviewed Mar 29, 2026

View reviewed changes

latenighthackathon added 2 commits March 29, 2026 17:11

latenighthackathon force-pushed the fix/onboard-spawnsync-timeout branch from 5624521 to f3460d7 Compare March 29, 2026 22:12

Merge remote-tracking branch 'origin/main' into fix/onboard-spawnsync…

7f8a710

…-timeout

kjw3 approved these changes Mar 30, 2026

View reviewed changes

kjw3 self-assigned this Mar 30, 2026

Merge branch 'main' into fix/onboard-spawnsync-timeout

1fb728e

kjw3 added 6 commits March 30, 2026 13:52

Merge branch 'main' into fix/onboard-spawnsync-timeout

e20d868

Merge branch 'main' into fix/onboard-spawnsync-timeout

2ac21d6

Merge branch 'main' into fix/onboard-spawnsync-timeout

5217fc5

Merge branch 'main' into fix/onboard-spawnsync-timeout

f4bf684

Merge branch 'main' into fix/onboard-spawnsync-timeout

fa870f9

Merge branch 'main' into fix/onboard-spawnsync-timeout

888868c

kjw3 merged commit 950b9db into NVIDIA:main Mar 30, 2026
8 checks passed

latenighthackathon deleted the fix/onboard-spawnsync-timeout branch March 30, 2026 18:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(onboard): add timeout to spawnSync calls to prevent hung processes#1069

fix(onboard): add timeout to spawnSync calls to prevent hung processes#1069
kjw3 merged 10 commits intoNVIDIA:mainfrom
latenighthackathon:fix/onboard-spawnsync-timeout

latenighthackathon commented Mar 29, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 29, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

kjw3 commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

latenighthackathon commented Mar 29, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Testing

Checklist

General

Code Changes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

kjw3 commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

latenighthackathon commented Mar 29, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 29, 2026 •

edited

Loading