Skip to content

fix: synchronize with udev via BSD flock after losetup to avoid race condition#330

Draft
Copilot wants to merge 6 commits intomainfrom
copilot/fix-race-condition-mounting-loop-devices
Draft

fix: synchronize with udev via BSD flock after losetup to avoid race condition#330
Copilot wants to merge 6 commits intomainfrom
copilot/fix-race-condition-mounting-loop-devices

Conversation

Copy link
Copy Markdown

Copilot AI commented Apr 21, 2026

After losetup --find --show --partscan, udev holds an exclusive BSD flock on the whole-disk device while processing it. Mounting immediately after attach can fail with special device /dev/loopNpM does not exist because udev's inotify watch hasn't finished. The correct fix per systemd BLOCK_DEVICE_LOCKING is to hold a shared flock on the loop device while performing partition operations.

Changes

  • imagecraft/pack/image.py

    • Image.attach_loopdev() now opens the loop device and acquires fcntl.LOCK_SH after attaching — this blocks until udev releases its exclusive lock, then holds the shared lock for the full duration of the context manager so udev cannot interfere with partition operations.
  • imagecraft/services/image.py

    • ImageService.attach_images() opens each loop device and acquires LOCK_SH after attaching (for both fresh attaches and reused devices). File handles are stored in _loop_fds and closed (releasing the lock) in detach_images(). The fd is closed on flock failure to prevent resource leaks.
  • Tests

    • Removed polling-based tests; tests now mock fcntl.flock and builtins.open
    • Added test_attach_images_flock_released_on_detach to verify the shared lock is released when images are detached
# After losetup --find --show --partscan returns /dev/loop7:
with open("/dev/loop7", "rb") as loop_fd:
    fcntl.flock(loop_fd, fcntl.LOCK_SH)  # blocks until udev is done
    # partition nodes are guaranteed to exist; lock held while we use them
    yield "/dev/loop7"

Copilot AI linked an issue Apr 21, 2026 that may be closed by this pull request
1 task
Copilot AI and others added 2 commits April 21, 2026 22:17
… condition

After `losetup --find --show --partscan`, the kernel creates partition device
nodes (e.g. /dev/loop7p1, /dev/loop7p2) asynchronously via udev. If code tries
to mount a partition immediately after attaching, it may fail with
"special device /dev/loopNpM does not exist".

Fix this by adding `_wait_for_loopdev_partitions()` which:
1. Calls `udevadm settle` to flush all pending udev events
2. Polls for each expected partition device node with a configurable timeout
3. Raises ImageError if nodes don't appear within the timeout

The wait is called from:
- `Image.attach_loopdev()` in imagecraft/pack/image.py
- `ImageService.attach_images()` in imagecraft/services/image.py (fresh attaches only)

Adds unit tests for the new helper function and the integration points.

Agent-Logs-Url: https://github.com/canonical/imagecraft/sessions/dfb57e1c-b428-4e70-b2b1-83a7428353fc

Co-authored-by: lengau <4305943+lengau@users.noreply.github.com>
…rtition number helper

- Replace broad `except Exception` with `except (subprocess.CalledProcessError, OSError)`
  to be precise about what failures are expected from udevadm settle
- Extract duplicated partition-number calculation into `_volume_partition_nums()`
  helper function and import it into services/image.py to avoid code duplication

Agent-Logs-Url: https://github.com/canonical/imagecraft/sessions/dfb57e1c-b428-4e70-b2b1-83a7428353fc

Co-authored-by: lengau <4305943+lengau@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix race condition when mounting loop devices fix: wait for loop device partition nodes after losetup to avoid race condition Apr 21, 2026
Copilot AI requested a review from lengau April 21, 2026 22:22
@mwhudson
Copy link
Copy Markdown

This misattributes the reason for the issue and thus is not the correct fix. It's not a race with udev triggers, it's a race with udev's inotify watch. As explained in https://systemd.io/BLOCK_DEVICE_LOCKING/ the correct fix is to flock th device containing the partition while doing anything (such as mounting) the partition itself.

Comment thread imagecraft/pack/image.py Outdated
emit.debug(
f"Attached image {self.disk_path} as loop device {self.loop_device}"
)
_wait_for_loopdev_partitions(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot We should wait for the partitions before logging that we've attached the device.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f7b72b2 — the wait is now called before emit.debug(...) so the log only fires once all partition nodes are confirmed to exist.

Comment thread imagecraft/pack/image.py Outdated
]


def _wait_for_loopdev_partitions(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot since this is being used in another file it can't be private.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f7b72b2 — renamed both _wait_for_loopdev_partitionswait_for_loopdev_partitions and _volume_partition_numsvolume_partition_nums and updated all call sites.

…ug log

- Rename _wait_for_loopdev_partitions -> wait_for_loopdev_partitions
- Rename _volume_partition_nums -> volume_partition_nums
  (both are imported by services/image.py so they can't be private)
- In Image.attach_loopdev(), call wait_for_loopdev_partitions() before
  emit.debug() so the log only fires once the partitions are actually ready
- Update all references in services/image.py and both test files

Agent-Logs-Url: https://github.com/canonical/imagecraft/sessions/065c4f4e-7f04-47a7-9ac7-f5d74ae39dfc

Co-authored-by: lengau <4305943+lengau@users.noreply.github.com>
Copilot AI requested a review from lengau April 22, 2026 05:32
@lengau
Copy link
Copy Markdown
Collaborator

lengau commented Apr 22, 2026

@copilot please address mwhudson's feedback too

Copilot AI and others added 2 commits April 22, 2026 05:43
Per systemd BLOCK_DEVICE_LOCKING, the correct fix for the partition-node
race is to hold a shared BSD flock on the loop device while using its
partitions. udev holds an exclusive lock while processing; a shared lock
blocks until udev is done and then keeps it held during our operations.

- imagecraft/pack/image.py:
  - Remove volume_partition_nums() and wait_for_loopdev_partitions()
  - Remove _UDEVADM_BIN, _PARTITION_WAIT_TIMEOUT, time, subprocess imports
  - In Image.attach_loopdev(): open loop device and acquire LOCK_SH for
    the duration of the context manager
- imagecraft/services/image.py:
  - Remove import of removed helpers
  - Add _loop_fds list to track open file handles holding flocks
  - In attach_images(): open device and acquire LOCK_SH after attaching
    (both fresh attach and reuse); store fd in _loop_fds
  - In detach_images(): close all fds (releasing flocks) before detaching
- tests: remove TestWaitForLoopdevPartitions; update all tests to mock
  fcntl.flock and builtins.open; add test for flock release on detach

Agent-Logs-Url: https://github.com/canonical/imagecraft/sessions/1d13f3fb-01b2-4424-b470-cbc2611b70e0

Co-authored-by: lengau <4305943+lengau@users.noreply.github.com>
- Wrap fcntl.flock in try/except in attach_images() to close the fd if
  flock raises, preventing a resource leak
- Move inline `import fcntl` statements to module level in both test files

Agent-Logs-Url: https://github.com/canonical/imagecraft/sessions/1d13f3fb-01b2-4424-b470-cbc2611b70e0

Co-authored-by: lengau <4305943+lengau@users.noreply.github.com>
Copy link
Copy Markdown
Author

Copilot AI commented Apr 22, 2026

@copilot please address mwhudson's feedback too

Done in 3814238. Replaced the udevadm settle + polling approach with proper BSD flock synchronization as described in https://systemd.io/BLOCK_DEVICE_LOCKING/:

  • udev holds an exclusive flock on the whole-disk device while processing it; acquiring a shared LOCK_SH blocks until udev is done, then holds it while we use the partitions
  • Image.attach_loopdev(): opens the loop device and holds LOCK_SH for the duration of the context manager
  • ImageService.attach_images(): same pattern for every attached device (fresh and reuse); fds stored in _loop_fds and closed (releasing the lock) in detach_images()
  • Removed wait_for_loopdev_partitions(), volume_partition_nums(), udevadm call, and all polling logic

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • systemd.io
    • Triggering command: /home/REDACTED/work/_temp/ghcca-node/node/bin/node /home/REDACTED/work/_temp/ghcca-node/node/bin/node --enable-source-maps /home/REDACTED/work/_temp/copilot-developer-action-main/dist/index.js (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Copilot AI changed the title fix: wait for loop device partition nodes after losetup to avoid race condition fix: synchronize with udev via BSD flock after losetup to avoid race condition Apr 22, 2026
Comment thread imagecraft/pack/image.py
# exclusive lock while it processes the new device; taking a shared
# lock here blocks until udev is done and then holds it for the
# duration of the context so udev does not interfere with our use.
with open(self.loop_device, "rb") as loop_fd:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot i think this is going to hold the lock for longer than we want and block udev from running on the device at all, which might not be what is wanted. In my experience it has been better to lock the device only while the partition is being mounted.

# exclusive lock while it processes the device; a shared lock here
# blocks until udev is done and then holds it while we use the
# device's partitions, preventing udev from interfering.
loop_fd = open(attached_device, "rb") # noqa: SIM115 (held deliberately)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot again i think this holds the lock for too long. we should only hold the lock while mounting or formatting the partition

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Race condition when mounting loop devices

3 participants