fix(onboard): add timeout to spawnSync calls to prevent hung processes#1069
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdded explicit timeouts to several synchronous shell invocations in the onboarding script: remote probes and model-list fetches now time out at 30s; local long-running subprocesses use longer timeouts (ollama pull: 10m, OpenShell install: 5m). pullOllamaModel now treats SIGTERM as a timeout and returns false. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
bin/lib/onboard.js (1)
1227-1234: 10-minute timeout added for model pulls.The 600-second timeout prevents indefinite hanging during
ollama pulloperations. When a timeout occurs, the function returnsfalse, and the caller displays an error message.Consider: The error message at Line 1245 doesn't distinguish between timeout and other failures (network errors, invalid model name, etc.). Users might benefit from timeout-specific guidance.
💡 Optional: Improve timeout diagnostics
Add a check for
result.signalto provide timeout-specific error messages:function pullOllamaModel(model) { const result = spawnSync("bash", ["-c", `ollama pull ${shellQuote(model)}`], { cwd: ROOT, encoding: "utf8", stdio: "inherit", timeout: 600_000, env: { ...process.env }, }); - return result.status === 0; + if (result.signal === 'SIGTERM' && result.status === null) { + return { ok: false, timeout: true }; + } + return { ok: result.status === 0 }; }Then update the caller (Line 1241) to check for timeout and provide specific guidance:
- if (!pullOllamaModel(model)) { + const pullResult = pullOllamaModel(model); + if (!pullResult.ok) { + if (pullResult.timeout) { + return { + ok: false, + message: `Timed out pulling Ollama model '${model}' after 10 minutes. ` + + "Large models may need more time. Try: ollama pull ${model} manually, or choose a smaller model." + }; + } return {Note: For very large models (>20GB) on slower connections, the 10-minute timeout might be tight, though this is an edge case.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 1227 - 1234, The current spawnSync("bash", ["-c", `ollama pull ${shellQuote(model)}`]) invocation only returns a boolean via result.status === 0, which loses timeout vs other-failure diagnostics; change the function to detect and propagate timeout-specific info by inspecting the spawnSync result (check result.signal and result.error if present, as well as result.status) and return or throw a value that distinguishes a timeout (e.g., result.signal === 'SIGTERM' or a custom status). Then update the caller that currently treats a false return from this function to check for that timeout indicator and show a timeout-specific error/help message (suggest increasing timeout or checking network) while preserving the existing generic error path for other failures; reference spawnSync, result.signal, result.status, and the "ollama pull" command to locate the relevant code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 1227-1234: The current spawnSync("bash", ["-c", `ollama pull
${shellQuote(model)}`]) invocation only returns a boolean via result.status ===
0, which loses timeout vs other-failure diagnostics; change the function to
detect and propagate timeout-specific info by inspecting the spawnSync result
(check result.signal and result.error if present, as well as result.status) and
return or throw a value that distinguishes a timeout (e.g., result.signal ===
'SIGTERM' or a custom status). Then update the caller that currently treats a
false return from this function to check for that timeout indicator and show a
timeout-specific error/help message (suggest increasing timeout or checking
network) while preserving the existing generic error path for other failures;
reference spawnSync, result.signal, result.status, and the "ollama pull" command
to locate the relevant code.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2b26a543-25d5-4344-8342-eef25264cdba
📒 Files selected for processing (1)
bin/lib/onboard.js
All spawnSync calls in the onboarding wizard lacked a timeout option, meaning a hung curl or stalled download would freeze the entire wizard with no recovery path. Add process-level timeouts as a safety net: - 30s for curl endpoint probes (10s buffer over curl --max-time 20) - 10min for ollama model downloads - 5min for install-openshell.sh execution Closes NVIDIA#1017
When spawnSync kills the process due to timeout, result.signal is SIGTERM. Surface a specific error message so users know the 10-minute limit was hit, rather than seeing the generic pull failure message. Addresses CodeRabbit review nitpick.
5624521 to
f3460d7
Compare
|
This looks like a real improvement over I’m comfortable with the core fix here: the missing One follow-up I’d suggest before or after merge is around the Ollama timeout UX. A 10-minute cap is a reasonable safety valve for onboarding, but it will be too short for some larger models even on decent connections. I don’t think the onboarding flow should be responsible for arbitrarily large model downloads anyway, so the user guidance should make that clearer. Two reasonable options:
If we keep the current timeout, I’d prefer the second option, since it matches a cleaner product story for onboarding. |
NVIDIA#1069) * fix(onboard): add timeout to spawnSync calls to prevent hung processes All spawnSync calls in the onboarding wizard lacked a timeout option, meaning a hung curl or stalled download would freeze the entire wizard with no recovery path. Add process-level timeouts as a safety net: - 30s for curl endpoint probes (10s buffer over curl --max-time 20) - 10min for ollama model downloads - 5min for install-openshell.sh execution Closes NVIDIA#1017 * fix(onboard): distinguish timeout from other failures in ollama pull When spawnSync kills the process due to timeout, result.signal is SIGTERM. Surface a specific error message so users know the 10-minute limit was hit, rather than seeing the generic pull failure message. Addresses CodeRabbit review nitpick. --------- Co-authored-by: latenighthackathon <latenighthackathon@users.noreply.github.com> Co-authored-by: KJ <kejones@nvidia.com>
Summary
All spawnSync calls in onboard.js lacked a timeout option. A hung curl or stalled download would freeze the entire wizard indefinitely. Added process-level timeouts and timeout-specific error messaging.
Related Issue
Closes #1017
Changes
Type of Change
Testing
npx prek run --all-filespasses (or equivalentlymake check).npm testpasses.make docsbuilds without warnings. (for doc-only changes)Checklist
General
Code Changes
Summary by CodeRabbit