Skip to content

Fix #858: manage linux platform layer async worker thread lifecycle#862

Open
Nox-MSFT wants to merge 1 commit intodevelopfrom
user/msft/nox-msft/issue-858
Open

Fix #858: manage linux platform layer async worker thread lifecycle#862
Nox-MSFT wants to merge 1 commit intodevelopfrom
user/msft/nox-msft/issue-858

Conversation

@Nox-MSFT
Copy link
Copy Markdown
Collaborator

The Download/Backup/Install/Apply/Restore callbacks in LinuxPlatformLayer spawned a std::thread and called detach() on it. The worker captured raw pointers to workCompletionData and workflowData, which are only guaranteed valid until WorkCompletionCallback fires. Agent shutdown (or workflow teardown) while the worker was still running could therefore invalidate those pointers and cause use-after-free / core dumps (related to #852).

Changes:

  • Add an instance-owned std::thread (_activeWorker) plus a std::mutex (_activeWorkerMutex) to the LinuxPlatformLayer.
  • Introduce a TrackWorker() helper that joins any previously installed worker under the mutex and takes ownership of the new one. Replace each of the five worker.detach() sites with a single call to TrackWorker(std::move(worker)), keeping the workflow-engine guarantee that at most one async operation is in flight while also making the ownership model explicit.
  • Extend ~LinuxPlatformLayer to signal cancellation and join the active worker before ExtensionManager::Uninit(). This guarantees that by the time the platform layer is destroyed, no worker can still be touching workCompletionData / workflowData.

Tests (src/platform_layers/linux_platform_layer/tests/linux_adu_core_impl_ut.cpp):

  • Happy-path async completion for Apply and Download (worker runs on a separate thread and invokes WorkCompletionCallback).
  • Structural coverage for Backup, Install, Restore through the same TrackWorker path.
  • Sequential invocations: five back-to-back ApplyCallback calls all complete without leaking or detaching the prior worker.
  • Destructor waits for a naturally-completing worker (regression for 🐞 Bug: Detached thread in ApplyCallback may access invalidated pointers (lifecycle violation) #858).
  • Destructor blocks until an in-flight worker finishes: the completion callback is pinned via a hold flag, Unregister is invoked from a second thread, and the test asserts Unregister does not return until the hold is released. Pre-fix (with detach) this would have observed Unregister returning early, proving the join is effective.

The Download/Backup/Install/Apply/Restore callbacks in LinuxPlatformLayer
spawned a std::thread and called detach() on it. The worker captured raw
pointers to workCompletionData and workflowData, which are only guaranteed
valid until WorkCompletionCallback fires. Agent shutdown (or workflow
teardown) while the worker was still running could therefore invalidate
those pointers and cause use-after-free / core dumps (related to #852).

Changes:
- Add an instance-owned std::thread (_activeWorker) plus a std::mutex
  (_activeWorkerMutex) to the LinuxPlatformLayer.
- Introduce a TrackWorker() helper that joins any previously installed
  worker under the mutex and takes ownership of the new one. Replace
  each of the five worker.detach() sites with a single call to
  TrackWorker(std::move(worker)), keeping the workflow-engine guarantee
  that at most one async operation is in flight while also making the
  ownership model explicit.
- Extend ~LinuxPlatformLayer to signal cancellation and join the active
  worker before ExtensionManager::Uninit(). This guarantees that by the
  time the platform layer is destroyed, no worker can still be touching
  workCompletionData / workflowData.

Tests (src/platform_layers/linux_platform_layer/tests/linux_adu_core_impl_ut.cpp):
- Happy-path async completion for Apply and Download (worker runs on a
  separate thread and invokes WorkCompletionCallback).
- Structural coverage for Backup, Install, Restore through the same
  TrackWorker path.
- Sequential invocations: five back-to-back ApplyCallback calls all
  complete without leaking or detaching the prior worker.
- Destructor waits for a naturally-completing worker (regression for
  #858).
- Destructor blocks until an in-flight worker finishes: the completion
  callback is pinned via a hold flag, Unregister is invoked from a
  second thread, and the test asserts Unregister does not return until
  the hold is released. Pre-fix (with detach) this would have
  observed Unregister returning early, proving the join is effective.
@Nox-MSFT Nox-MSFT marked this pull request as ready for review April 27, 2026 22:52
@Nox-MSFT Nox-MSFT requested review from chgennar and mirskiy April 27, 2026 22:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant