fix(ci): improve LXD container setup reliability#17
Conversation
Waiting for a nameserver entry in resolv.conf is not sufficient on some distros (e.g. Slackware) where the network stack takes longer to become fully functional. Pinging 1.1.1.1 directly gives a reliable signal that the network is actually reachable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Up to standards ✅🟢 Issues
|
There was a problem hiding this comment.
Code Review
This pull request updates the Makefile to use a ping command to verify network connectivity in LXD containers before proceeding with package installation. Feedback indicates that the ping command may not be available in all minimal images, which could cause the CI to hang, and suggests implementing a fallback mechanism to ensure the loop can terminate.
| fi | ||
| $(LXC) exec $(LXD_CONTAINER) -- sh -c '\ | ||
| until grep -q ^nameserver /etc/resolv.conf 2>/dev/null; do sleep 1; done; \ | ||
| until ping -c1 -W2 1.1.1.1 > /dev/null 2>&1; do sleep 1; done; \ |
There was a problem hiding this comment.
The ping command is not guaranteed to be present in minimal LXD images (e.g., the official Arch Linux or minimal Debian images often do not include iputils by default). Since this check runs before any packages are installed, if ping is missing, the until loop will fail with exit code 127 and loop indefinitely, causing the CI to hang.
Consider adding a check for the existence of ping or providing a fallback to the previous /etc/resolv.conf logic to ensure the loop can terminate even if ping is unavailable.
until ping -c1 -W2 1.1.1.1 > /dev/null 2>&1 || { ! command -v ping >/dev/null 2>&1 && grep -q ^nameserver /etc/resolv.conf 2>/dev/null; }; do sleep 1; done; \
ping requires ICMP which is blocked in many CI environments and can hang indefinitely. Using bash's built-in /dev/tcp to probe TCP port 53 on 1.1.1.1 is reliable, requires no external tools, and respects connection timeouts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
/dev/tcp is a bash-only feature. The containers run 'sh -c ...' which on Alpine is busybox ash and on Debian/Ubuntu is dash — neither supports /dev/tcp, so the until loop spins forever. /proc/net/route is a Linux kernel file always present, requires no external tools, and works in any POSIX sh. The awk check succeeds once a default route with a non-zero gateway is established (fields 2 and 3 in hex: 00000000 = destination 0.0.0.0, non-zero = actual gateway). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The awk approach had two problems: - $2/$3 were eaten by Make (needed $$2/$$3) - awk single quotes broke the outer sh -c '...' quoting 'ip route | grep -q "^default"' avoids both issues: no dollar signs, no nested single quotes. 'ip' is available pre-install on all LXD images (busybox on Alpine, iproute2 on everything else). Verified locally with alpine/3.21 (busybox sh): 29 passed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… .venv lxc file push has no exclude option, so switch to a tar pipe. Exclude .venv, .git, and __pycache__ to avoid transferring large/ platform-specific files to containers. Also rm -rf .venv before setup so reused containers don't have a stale host-built venv with shebangs pointing to host paths, which caused 'uv run pytest' to fail with 'No such file or directory' on Arch Linux and ubuntu-minimal-daily:26.04. Verified locally: 21/21 distros pass (plus alpine/3.21 from earlier). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
0ebb7e3 to
584c9c2
Compare
ip route confirms routing is up but DNS may still be initialising. Add a nslookup check (available via busybox on Alpine and widely elsewhere) after the default-route check to avoid transient DNS failures during apk/apt-get/dnf/etc. Fixes arm64 CI failure where alpine/3.23 got a DNS transient error when fetching dl-cdn.alpinelinux.org immediately after network came up. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
584c9c2 to
ad0e8c3
Compare
Fixes several issues with the
test-lxdMakefile target that caused containers to hang or fail in CI.Changes
Network readiness check
Waits for both a default route and DNS to be functional before running package installs:
Previous attempts all failed in CI:
pinghangs — ICMP is blocked in the GitHub Actions environment/dev/tcpis bash-only; containers usingdash(Debian/Ubuntu) orbusybox sh(Alpine) loop foreverip routealone is not sufficient — routing comes up before DNS is ready, causingapk/pacmanto fail with "Could not resolve host" on Alpine and Arch Linux ARMnslookup cloudflare.comalone hangs on AlmaLinux (and other RHEL-based distros) —nslookupis not pre-installedip routeis available everywhere (iproute2 or busybox).getentis available on all glibc/musl distros and resolves via the system resolver.nslookupis kept as a fallback for any distro wheregetentis unavailable.Package install retry logic
Wraps all package manager invocations in a
retry()function (3 attempts, 5 s backoff):DNS on freshly launched LXD containers can be flaky even after the
getentcheck passes — observed on Arch Linux ARM wheremirror.archlinuxarm.orgfails to resolve on the first attempt despitecloudflare.comresolving. Retry handles this without needing to perfectly probe every possible mirror hostname.Container file transfer
Replaces
lxc file push --recursivewith atarpipe to exclude platform-specific files:tar -C $(dir $(PWD)) \ --exclude='$(notdir $(PWD))/.venv' \ --exclude='$(notdir $(PWD))/.git' \ --exclude='*/__pycache__' \ -c $(notdir $(PWD)) \ | lxc exec ... -- tar -C /root -x.venv,.git, and__pycache__from the transfer (lxc file pushhas no--excludeoption)rm -rf .venvbefore setup so reused containers do not keep a stale host-built venvPushing the host
.venvcauseduv run pytestto fail withNo such file or directory— the venv scripts had shebangs pointing to the host Python path.Verified
All 22 distros pass locally (x86-64). CI arm64 runner failures traced to transient DNS on Alpine and Arch Linux ARM, addressed by the DNS wait and retry changes above.