Skip to content

fix(ci): improve LXD container setup reliability#17

Merged
lengau merged 6 commits intomainfrom
work/ci-more-distros
Apr 22, 2026
Merged

fix(ci): improve LXD container setup reliability#17
lengau merged 6 commits intomainfrom
work/ci-more-distros

Conversation

@lengau
Copy link
Copy Markdown
Owner

@lengau lengau commented Apr 22, 2026

Fixes several issues with the test-lxd Makefile target that caused containers to hang or fail in CI.

Changes

Network readiness check

Waits for both a default route and DNS to be functional before running package installs:

until ip route 2>/dev/null | grep -q "^default"; do sleep 1; done
until getent hosts cloudflare.com >/dev/null 2>&1 || nslookup cloudflare.com >/dev/null 2>&1; do sleep 1; done

Previous attempts all failed in CI:

  • ping hangs — ICMP is blocked in the GitHub Actions environment
  • /dev/tcp is bash-only; containers using dash (Debian/Ubuntu) or busybox sh (Alpine) loop forever
  • ip route alone is not sufficient — routing comes up before DNS is ready, causing apk/pacman to fail with "Could not resolve host" on Alpine and Arch Linux ARM
  • nslookup cloudflare.com alone hangs on AlmaLinux (and other RHEL-based distros) — nslookup is not pre-installed

ip route is available everywhere (iproute2 or busybox). getent is available on all glibc/musl distros and resolves via the system resolver. nslookup is kept as a fallback for any distro where getent is unavailable.

Package install retry logic

Wraps all package manager invocations in a retry() function (3 attempts, 5 s backoff):

retry() { n=0; until [ $n -ge 3 ]; do "$@" && return 0; n=$((n+1)); sleep 5; done; return 1; }

DNS on freshly launched LXD containers can be flaky even after the getent check passes — observed on Arch Linux ARM where mirror.archlinuxarm.org fails to resolve on the first attempt despite cloudflare.com resolving. Retry handles this without needing to perfectly probe every possible mirror hostname.

Container file transfer

Replaces lxc file push --recursive with a tar pipe to exclude platform-specific files:

tar -C $(dir $(PWD)) \
    --exclude='$(notdir $(PWD))/.venv' \
    --exclude='$(notdir $(PWD))/.git' \
    --exclude='*/__pycache__' \
    -c $(notdir $(PWD)) \
    | lxc exec ... -- tar -C /root -x
  • Excludes .venv, .git, and __pycache__ from the transfer (lxc file push has no --exclude option)
  • Adds rm -rf .venv before setup so reused containers do not keep a stale host-built venv

Pushing the host .venv caused uv run pytest to fail with No such file or directory — the venv scripts had shebangs pointing to the host Python path.

Verified

All 22 distros pass locally (x86-64). CI arm64 runner failures traced to transient DNS on Alpine and Arch Linux ARM, addressed by the DNS wait and retry changes above.

Waiting for a nameserver entry in resolv.conf is not sufficient on some
distros (e.g. Slackware) where the network stack takes longer to become
fully functional. Pinging 1.1.1.1 directly gives a reliable signal that
the network is actually reachable.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented Apr 22, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes. Give us feedback

@lengau lengau changed the title ci: add archlinux, debian/12, openSUSE, and Rocky Linux to test matrix ci: use ping to wait for network in LXD containers Apr 22, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the Makefile to use a ping command to verify network connectivity in LXD containers before proceeding with package installation. Feedback indicates that the ping command may not be available in all minimal images, which could cause the CI to hang, and suggests implementing a fallback mechanism to ensure the loop can terminate.

Comment thread Makefile Outdated
fi
$(LXC) exec $(LXD_CONTAINER) -- sh -c '\
until grep -q ^nameserver /etc/resolv.conf 2>/dev/null; do sleep 1; done; \
until ping -c1 -W2 1.1.1.1 > /dev/null 2>&1; do sleep 1; done; \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The ping command is not guaranteed to be present in minimal LXD images (e.g., the official Arch Linux or minimal Debian images often do not include iputils by default). Since this check runs before any packages are installed, if ping is missing, the until loop will fail with exit code 127 and loop indefinitely, causing the CI to hang.

Consider adding a check for the existence of ping or providing a fallback to the previous /etc/resolv.conf logic to ensure the loop can terminate even if ping is unavailable.

		until ping -c1 -W2 1.1.1.1 > /dev/null 2>&1 || { ! command -v ping >/dev/null 2>&1 && grep -q ^nameserver /etc/resolv.conf 2>/dev/null; }; do sleep 1; done; \

lengau and others added 4 commits April 21, 2026 22:06
ping requires ICMP which is blocked in many CI environments and can
hang indefinitely. Using bash's built-in /dev/tcp to probe TCP port 53
on 1.1.1.1 is reliable, requires no external tools, and respects
connection timeouts.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
/dev/tcp is a bash-only feature. The containers run 'sh -c ...' which on
Alpine is busybox ash and on Debian/Ubuntu is dash — neither supports
/dev/tcp, so the until loop spins forever.

/proc/net/route is a Linux kernel file always present, requires no
external tools, and works in any POSIX sh. The awk check succeeds once
a default route with a non-zero gateway is established (fields 2 and 3
in hex: 00000000 = destination 0.0.0.0, non-zero = actual gateway).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The awk approach had two problems:
- $2/$3 were eaten by Make (needed $$2/$$3)
- awk single quotes broke the outer sh -c '...' quoting

'ip route | grep -q "^default"' avoids both issues: no dollar signs,
no nested single quotes. 'ip' is available pre-install on all LXD images
(busybox on Alpine, iproute2 on everything else).

Verified locally with alpine/3.21 (busybox sh): 29 passed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… .venv

lxc file push has no exclude option, so switch to a tar pipe.
Exclude .venv, .git, and __pycache__ to avoid transferring large/
platform-specific files to containers.

Also rm -rf .venv before setup so reused containers don't have a stale
host-built venv with shebangs pointing to host paths, which caused
'uv run pytest' to fail with 'No such file or directory' on Arch Linux
and ubuntu-minimal-daily:26.04.

Verified locally: 21/21 distros pass (plus alpine/3.21 from earlier).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@lengau lengau changed the title ci: use ping to wait for network in LXD containers fix(ci): improve LXD container setup reliability Apr 22, 2026
@lengau lengau force-pushed the work/ci-more-distros branch from 0ebb7e3 to 584c9c2 Compare April 22, 2026 03:38
ip route confirms routing is up but DNS may still be initialising.
Add a nslookup check (available via busybox on Alpine and widely
elsewhere) after the default-route check to avoid transient DNS
failures during apk/apt-get/dnf/etc.

Fixes arm64 CI failure where alpine/3.23 got a DNS transient error
when fetching dl-cdn.alpinelinux.org immediately after network came up.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@lengau lengau force-pushed the work/ci-more-distros branch from 584c9c2 to ad0e8c3 Compare April 22, 2026 04:05
@lengau lengau merged commit 1156a9b into main Apr 22, 2026
22 checks passed
@lengau lengau deleted the work/ci-more-distros branch April 22, 2026 04:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant