Skip to content

fix(onboard): add timeout to spawnSync calls to prevent hung processes#1069

Merged
kjw3 merged 10 commits intoNVIDIA:mainfrom
latenighthackathon:fix/onboard-spawnsync-timeout
Mar 30, 2026
Merged

fix(onboard): add timeout to spawnSync calls to prevent hung processes#1069
kjw3 merged 10 commits intoNVIDIA:mainfrom
latenighthackathon:fix/onboard-spawnsync-timeout

Conversation

@latenighthackathon
Copy link
Copy Markdown
Contributor

@latenighthackathon latenighthackathon commented Mar 29, 2026

Summary

All spawnSync calls in onboard.js lacked a timeout option. A hung curl or stalled download would freeze the entire wizard indefinitely. Added process-level timeouts and timeout-specific error messaging.

Related Issue

Closes #1017

Changes

  • 30s timeout on curl endpoint probes (buffer over curl --max-time 20)
  • 10min timeout on ollama model downloads
  • 5min timeout on install-openshell.sh
  • SIGTERM detection in pullOllamaModel() for timeout-specific error message

Type of Change

  • Code change for a new feature, bug fix, or refactor.
  • Code change with doc updates.
  • Doc only. Prose changes without code sample modifications.
  • Doc only. Includes code sample changes.

Testing

  • npx prek run --all-files passes (or equivalently make check).
  • npm test passes.
  • make docs builds without warnings. (for doc-only changes)

Checklist

General

Code Changes

  • Formatters applied.
  • No secrets, API keys, or credentials committed.

Summary by CodeRabbit

  • Bug Fixes
    • Added execution time limits to remote probes and model list retrievals (short limits ~30s) to prevent hangs.
    • Added a medium timeout (~5min) for installer-like operations and a longer timeout (~10min) for large model downloads.
    • Improved timeout detection and error reporting so timed-out operations fail gracefully and return clear failure status.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 29, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Added explicit timeouts to several synchronous shell invocations in the onboarding script: remote probes and model-list fetches now time out at 30s; local long-running subprocesses use longer timeouts (ollama pull: 10m, OpenShell install: 5m). pullOllamaModel now treats SIGTERM as a timeout and returns false.

Changes

Cohort / File(s) Summary
Onboarding timeouts
bin/lib/onboard.js
Added timeout: 30_000 to remote endpoint probes and model-list fetches (OpenAI-like, Anthropic, NVIDIA). Increased timeouts for local long-running operations: pullOllamaModel() uses timeout: 600_000 and detects result.signal === "SIGTERM" to emit a timeout-specific error and return false; installOpenshell() uses timeout: 300_000. No other logic or return-value changes.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

A rabbit taps a stopwatch bright, 🐇
Bash calls now end within their rhyme,
Long pulls get patience, short probes get light,
No frozen wizards wasting time,
Hops continue — tidy, right! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically describes the main change: adding timeout constraints to spawnSync calls to prevent hung processes.
Linked Issues check ✅ Passed The changes fully address the requirements in issue #1017: timeout properties added to all identified spawnSync calls with appropriate durations (30s for probes, 10-min for ollama, 5-min for installer), SIGTERM detection implemented for timeout-specific error messaging, and the change prevents indefinite process hangs.
Out of Scope Changes check ✅ Passed All changes are scoped to adding timeout constraints and timeout detection in spawnSync calls as specified in issue #1017; no unrelated modifications are present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
bin/lib/onboard.js (1)

1227-1234: 10-minute timeout added for model pulls.

The 600-second timeout prevents indefinite hanging during ollama pull operations. When a timeout occurs, the function returns false, and the caller displays an error message.

Consider: The error message at Line 1245 doesn't distinguish between timeout and other failures (network errors, invalid model name, etc.). Users might benefit from timeout-specific guidance.

💡 Optional: Improve timeout diagnostics

Add a check for result.signal to provide timeout-specific error messages:

 function pullOllamaModel(model) {
   const result = spawnSync("bash", ["-c", `ollama pull ${shellQuote(model)}`], {
     cwd: ROOT,
     encoding: "utf8",
     stdio: "inherit",
     timeout: 600_000,
     env: { ...process.env },
   });
-  return result.status === 0;
+  if (result.signal === 'SIGTERM' && result.status === null) {
+    return { ok: false, timeout: true };
+  }
+  return { ok: result.status === 0 };
 }

Then update the caller (Line 1241) to check for timeout and provide specific guidance:

-    if (!pullOllamaModel(model)) {
+    const pullResult = pullOllamaModel(model);
+    if (!pullResult.ok) {
+      if (pullResult.timeout) {
+        return {
+          ok: false,
+          message: `Timed out pulling Ollama model '${model}' after 10 minutes. ` +
+            "Large models may need more time. Try: ollama pull ${model} manually, or choose a smaller model."
+        };
+      }
       return {

Note: For very large models (>20GB) on slower connections, the 10-minute timeout might be tight, though this is an edge case.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 1227 - 1234, The current spawnSync("bash",
["-c", `ollama pull ${shellQuote(model)}`]) invocation only returns a boolean
via result.status === 0, which loses timeout vs other-failure diagnostics;
change the function to detect and propagate timeout-specific info by inspecting
the spawnSync result (check result.signal and result.error if present, as well
as result.status) and return or throw a value that distinguishes a timeout
(e.g., result.signal === 'SIGTERM' or a custom status). Then update the caller
that currently treats a false return from this function to check for that
timeout indicator and show a timeout-specific error/help message (suggest
increasing timeout or checking network) while preserving the existing generic
error path for other failures; reference spawnSync, result.signal,
result.status, and the "ollama pull" command to locate the relevant code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 1227-1234: The current spawnSync("bash", ["-c", `ollama pull
${shellQuote(model)}`]) invocation only returns a boolean via result.status ===
0, which loses timeout vs other-failure diagnostics; change the function to
detect and propagate timeout-specific info by inspecting the spawnSync result
(check result.signal and result.error if present, as well as result.status) and
return or throw a value that distinguishes a timeout (e.g., result.signal ===
'SIGTERM' or a custom status). Then update the caller that currently treats a
false return from this function to check for that timeout indicator and show a
timeout-specific error/help message (suggest increasing timeout or checking
network) while preserving the existing generic error path for other failures;
reference spawnSync, result.signal, result.status, and the "ollama pull" command
to locate the relevant code.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2b26a543-25d5-4344-8342-eef25264cdba

📥 Commits

Reviewing files that changed from the base of the PR and between eb4ba8c and 0f928f7.

📒 Files selected for processing (1)
  • bin/lib/onboard.js

All spawnSync calls in the onboarding wizard lacked a timeout option,
meaning a hung curl or stalled download would freeze the entire wizard
with no recovery path.

Add process-level timeouts as a safety net:
- 30s for curl endpoint probes (10s buffer over curl --max-time 20)
- 10min for ollama model downloads
- 5min for install-openshell.sh execution

Closes NVIDIA#1017
When spawnSync kills the process due to timeout, result.signal is
SIGTERM. Surface a specific error message so users know the 10-minute
limit was hit, rather than seeing the generic pull failure message.

Addresses CodeRabbit review nitpick.
@latenighthackathon latenighthackathon force-pushed the fix/onboard-spawnsync-timeout branch from 5624521 to f3460d7 Compare March 29, 2026 22:12
@kjw3 kjw3 self-assigned this Mar 30, 2026
@kjw3
Copy link
Copy Markdown
Contributor

kjw3 commented Mar 30, 2026

This looks like a real improvement over main.

I’m comfortable with the core fix here: the missing spawnSync timeouts in onboard.js are a real problem, and this change prevents the onboarding flow from hanging indefinitely on stalled probes, ollama pull, or the OpenShell install path.

One follow-up I’d suggest before or after merge is around the Ollama timeout UX. A 10-minute cap is a reasonable safety valve for onboarding, but it will be too short for some larger models even on decent connections. I don’t think the onboarding flow should be responsible for arbitrarily large model downloads anyway, so the user guidance should make that clearer.

Two reasonable options:

  • bump the 10-minute timeout if the goal is to support larger in-onboarding pulls
  • keep the timeout as-is, but improve the message to explicitly suggest pre-pulling large models outside the NemoClaw install/onboard path and then retrying

If we keep the current timeout, I’d prefer the second option, since it matches a cleaner product story for onboarding.

@kjw3 kjw3 merged commit 950b9db into NVIDIA:main Mar 30, 2026
8 checks passed
@latenighthackathon latenighthackathon deleted the fix/onboard-spawnsync-timeout branch March 30, 2026 18:32
quanticsoul4772 pushed a commit to quanticsoul4772/NemoClaw that referenced this pull request Mar 30, 2026
NVIDIA#1069)

* fix(onboard): add timeout to spawnSync calls to prevent hung processes

All spawnSync calls in the onboarding wizard lacked a timeout option,
meaning a hung curl or stalled download would freeze the entire wizard
with no recovery path.

Add process-level timeouts as a safety net:
- 30s for curl endpoint probes (10s buffer over curl --max-time 20)
- 10min for ollama model downloads
- 5min for install-openshell.sh execution

Closes NVIDIA#1017

* fix(onboard): distinguish timeout from other failures in ollama pull

When spawnSync kills the process due to timeout, result.signal is
SIGTERM. Surface a specific error message so users know the 10-minute
limit was hit, rather than seeing the generic pull failure message.

Addresses CodeRabbit review nitpick.

---------

Co-authored-by: latenighthackathon <latenighthackathon@users.noreply.github.com>
Co-authored-by: KJ <kejones@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[NemoClaw] spawnSync calls in onboard.js have no timeout — hung curl freezes entire onboarding wizard

2 participants