fix: wait for cloud-init completion before connecting to new VMs#930
Merged
fix: wait for cloud-init completion before connecting to new VMs#930
Conversation
Add all development tools that were in the Python vm_provisioning.py but missing from the Rust cloud_init implementation: - GitHub CLI (gh) via official apt repo - Azure CLI via InstallAzureCLIDeb script - Node.js 22.x via NodeSource - Claude Code AI assistant - Go 1.24.1 - Python 3.13 + python-is-python3 - uv package manager - tmux configuration (status bar, socket permissions) - Docker post-install (add user to docker group) - npm global prefix configuration - .bashrc PATH additions (Go, Cargo, npm) Updates both cloud-init code paths: - cloud_init.rs: YAML-based cloud-init config (packages + runcmd) - vm.rs: shell-script based cloud-init provisioning Removes broken amplihack make install (target doesn't exist). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Quality audit findings: - default_dev_setup_commands() now takes username parameter instead of hardcoding 'azureuser' (HIGH: would fail for non-default usernames) - tmux socket dir uses dynamic UID via id -u instead of hardcoded 1000 (MEDIUM: would fail if user UID != 1000) - Added version verification step to shell-script cloud-init path (matches YAML path's existing verification) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Quality audit cycle 2 findings: - rustc --version ran as root but Rust is installed in user homedir (HIGH: verification always failed even when install succeeded) - Standardize on apt-get upgrade instead of full-upgrade to match shell-script path and avoid unexpected package removal (MEDIUM) - Remove unnecessary ripgrep reinstall (only needed with full-upgrade) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Quality audit cycle 3: eliminate predictable /tmp path for GitHub CLI GPG keyring download. Download directly to /etc/apt/keyrings/ in both cloud-init code paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After VM creation, azlin now waits for cloud-init provisioning to finish before forwarding credentials or auto-connecting the user. Previously, the code only waited for SSH to become reachable (~2 min), but cloud-init takes 5-10 min to install all tools (gh, az, node, rustc, etc.). Changes: - Increase SSH wait timeout from 120s to 300s - Add wait_for_cloud_init() that polls cloud-init status over SSH every 10s until done/disabled/error (600s timeout, best-effort) - Add ssh_output() helper that captures remote command stdout - Add resolve_ssh_key() + base_ssh_args() to inject identity key (~/.ssh/azlin_key) into all SSH/SCP operations - Handle cloud-init "disabled" state as terminal (no 600s hang) - Add ConnectTimeout=10 to ssh_output to prevent hung connections - Update credential-forwarding docs with cloud-init wait behavior Fixes #929 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Owner
Author
Review Findings (Steps 10 + 16)Reviewer Agent
Security Agent
Philosophy Guardian (Zen-Architect)
All three reviews pass. No blocking issues remain. |
… into fix/issue-929-cloud-init-wait
The cloud-init script had two issues that caused set -euo pipefail to abort before installing gh, az, node, and other tools: 1. update-alternatives --set python3 python3.13 breaks apt tools because apt_pkg is built for system Python 3.12, causing apt-get update to fail with "No module named 'apt_pkg'" - fixed by installing python3.13 without changing the system python3 default 2. get-pip.py fails on Ubuntu 24.04 because pip 24.0 is already installed as a debian package - removed the pip reinstall entirely 3. Em dash (U+2014) in shell comment caused Azure CLI latin-1 encoding error when passing --custom-data - replaced with ASCII hyphen Fixes #929 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #929 — VM creation now waits for cloud-init provisioning to complete before forwarding credentials or connecting the user.
wait_for_cloud_init()— pollscloud-init statusover SSH every 10s until done/disabled/error (600s max)resolve_ssh_key()+base_ssh_args()— inject identity key (~/.ssh/azlin_key) into all SSH/SCP operationsstatus: disabledas terminal (prevents 600s hang on non-cloud-init VMs)ConnectTimeout=10tossh_outputto prevent hung connectionsStep 13: Local Testing Results
Test Environment: fix/issue-929-cloud-init-wait branch, release binary, 2026-04-01
Tests Executed:
azlin new --name test-929-v3 --yes --no-auto-connect→ SSH connected, cloud-init waited, credentials forwarded ✅cloud-init statusreports "done" before proceeding, identity key used for SSH/SCP ✅Regressions: All 2180 unit tests pass, clippy clean ✅
Issues Found: Cloud-init script itself has pre-existing issues installing some tools (gh, node, az) — separate from this PR's wait-for-completion fix.
Test plan
azlin new --name <name>and verify:azlin connectto already-provisioned VMs is unaffectedcargo test --package azlin— all tests pass🤖 Generated with Claude Code