Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 13 additions & 2 deletions framework/common/vk_initializers.h
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
/* Copyright (c) 2019-2022, Sascha Willems
/* Copyright (c) 2019-2024, Sascha Willems
*
* SPDX-License-Identifier: Apache-2.0
*
Expand Down Expand Up @@ -546,7 +546,7 @@ inline VkPipelineMultisampleStateCreateInfo pipeline_multisample_state_create_in
}

inline VkPipelineDynamicStateCreateInfo pipeline_dynamic_state_create_info(
const VkDynamicState * dynamic_states,
const VkDynamicState *dynamic_states,
uint32_t dynamicStateCount,
VkPipelineDynamicStateCreateFlags flags = 0)
{
Expand Down Expand Up @@ -652,5 +652,16 @@ inline VkSpecializationInfo specialization_info(uint32_t map_entry_count, const
specialization_info.pData = data;
return specialization_info;
}

inline VkTimelineSemaphoreSubmitInfo timeline_semaphore_submit_info(uint32_t wait_value_count, uint64_t *wait_values, uint32_t signal_value_count, uint64_t *signal_values)
{
return VkTimelineSemaphoreSubmitInfo{
VK_STRUCTURE_TYPE_TIMELINE_SEMAPHORE_SUBMIT_INFO,
NULL,
wait_value_count,
wait_values,
signal_value_count,
signal_values};
}
} // namespace initializers
} // namespace vkb
37 changes: 16 additions & 21 deletions samples/extensions/timeline_semaphore/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2021, Arm Limited and Contributors
# Copyright (c) 2021-2024, Arm Limited and Contributors
#
# SPDX-License-Identifier: Apache-2.0
#
Expand All @@ -15,24 +15,19 @@
# limitations under the License.
#

if (NOT WIN32)
# Not enabled on Windows at this time due to bugs.
# Out-of-order submission in presentation causes kernel level issues,
# and need to be figured out before this sample can be enabled on Windows.
get_filename_component(FOLDER_NAME ${CMAKE_CURRENT_LIST_DIR} NAME)
get_filename_component(PARENT_DIR ${CMAKE_CURRENT_LIST_DIR} PATH)
get_filename_component(CATEGORY_NAME ${PARENT_DIR} NAME)
get_filename_component(FOLDER_NAME ${CMAKE_CURRENT_LIST_DIR} NAME)
get_filename_component(PARENT_DIR ${CMAKE_CURRENT_LIST_DIR} PATH)
get_filename_component(CATEGORY_NAME ${PARENT_DIR} NAME)

add_sample_with_tags(
ID ${FOLDER_NAME}
CATEGORY ${CATEGORY_NAME}
AUTHOR "Hans-Kristian Arntzen"
NAME "Timeline semaphore"
DESCRIPTION "Demonstrates use of timeline semaphores to express complex queue dependency graphs"
SHADER_FILES_GLSL
"timeline_semaphore/game_of_life_update.comp"
"timeline_semaphore/game_of_life_mutate.comp"
"timeline_semaphore/game_of_life_init.comp"
"timeline_semaphore/render.vert"
"timeline_semaphore/render.frag")
endif()
add_sample_with_tags(
ID ${FOLDER_NAME}
CATEGORY ${CATEGORY_NAME}
AUTHOR "Hans-Kristian Arntzen"
NAME "Timeline semaphore"
DESCRIPTION "Demonstrates use of timeline semaphores to express complex queue dependency graphs"
SHADER_FILES_GLSL
"timeline_semaphore/game_of_life_update.comp"
"timeline_semaphore/game_of_life_mutate.comp"
"timeline_semaphore/game_of_life_init.comp"
"timeline_semaphore/render.vert"
"timeline_semaphore/render.frag")
160 changes: 78 additions & 82 deletions samples/extensions/timeline_semaphore/README.adoc
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
////
- Copyright (c) 2021-2023, Arm Limited and Contributors
- Copyright (c) 2021-2024, Arm Limited and Contributors
-
- SPDX-License-Identifier: Apache-2.0
-
Expand Down Expand Up @@ -189,69 +189,58 @@ This sample could trivially be done with binary semaphores of course, so in this

=== Async worker thread - out-of-order submission

The key aspect we use to demonstrate out of order submission is a dedicated worker thread which does all work related to simulation on the async compute queue.
It never synchronizes with the main thread except at teardown, so the only way it synchronizes is through timeline semaphores.
Submission order is completely out-of-order in this case and forward progress in the async queue is generally blocked by the main thread submitting more work.
The key aspects we use to demonstrate out of order submission are dedicated workers thread which perform all work related to simulation on the async compute queue, and drawing on the graphics queue.
They never synchronize with the main thread except at teardown, so the only way to synchronize them is through timeline semaphores.
To avoid issues when running the sample on Windows platforms (particularly when resizing the window), forward progress in the queues is throttled by the main thread (i.e. only allowing the timeline to advance
when a render call is active).


=== Data flow

To simulate "Game of Life", we allocate two images of 64x64 RGBA8.
First, one image is initialized with initial state, and from here there is a ping-pong where image N is updated, while reading from image 1 - N.

After updating image N, the main thread will sample from image N.
Before async compute updates the same image index N again, it must wait for graphics queue to complete.
With the double buffer in play, the async queue can run ahead for a little while and it will be mostly stalled by graphics queue.

The sequential flow of the rendering is something like, assuming two timeline semaphores A and G:

* Async compute write image 1.
* Async compute signal A = 1.
* Graphics wait A = 1.
* Graphics read image 1.
* Graphics signal G = 1.
* Async compute wait A = 1.
(Could use pipeline barrier of course, but hey!)
* Async compute write image 0.
* Async compute signal A = 2.
* Graphics wait A = 2.
* Graphics read image 0.
* Graphics signal G = 2.
* Async compute wait G = 1.
(Resolve write-after-read hazard)
* Async compute wait A = 2.
(Could use pipeline barrier of course, but hey!)
* Async compute wait host A = 1.
(Wait for command buffer to retire so we can re-record it!)
* Async compute write image 1.
* Async compute signal A = 3.
* Graphics wait A = 3.
* Graphics read image 1.
* Graphics signal G = 3.

The sequential flow of the rendering is something like:

* Compute: wait for "submit"
* Graphics: wait for "submit"
* Main: acquires the swapchain image
* Main: signal "submit"
* Main: wait for "present"
* Compute: wait for "image_acquired" (binary semaphore)
* Graphics: wait for "draw"
* Compute: write image
* Compute: signal "draw"
* Compute: wait for "end of frame"
* Graphics: read image
* Graphics: signal "present"
* Graphics: wait for "end of frame"
* Main: present swapchain
* Main: signals "end of frame"
* Compute: wait for "submit"
* Graphics: wait for "submit"

And so on ...
With out of order signal, we can end up observing this order of submissions instead.

* Async compute write image 1.
* Async compute signal A = 1.
* Async compute wait A = 1.
* Async compute write image 0.
* Async compute signal A = 2.
* Async compute wait G = 1.
(Out of order submission, queue progress is stalled, but we can keep recording)
* Async compute wait A = 2.
* Async compute wait host A = 1.
* Async compute write image 1.
* Async compute signal A = 3.
* Graphics wait A = 1.
* Graphics read image 1.
* Graphics signal G = 1.
(Unblocks queue forward progress)
* Graphics wait A = 2.
* Graphics read image 0.
* Graphics signal G = 2.
* Graphics wait A = 3.
* Graphics read image 1.
* Graphics signal G = 3.
* Compute: wait for "submit"
* Graphics: wait for "submit"
* Main: acquires the swapchain image
* Main: signal "submit"
* Graphics: wait for "draw"
* Compute: wait for "image_acquired" (binary semaphore)
* Compute: write image
* Compute: signal "draw"
* Graphics: read image
* Graphics: signal "present"
* Main: wait for "present"
* Main: present swapchain
* Compute: wait for "end of frame"
* Main: signals "end of frame"
* Graphics: wait for "end of frame"
* Compute: wait for "submit"
* Graphics: wait for "submit"

When submitting out of order, it is important that you don't just submit work way ahead of where the GPU actually is, since the latency becomes extremely large.
The natural place to keep submission explosion under control here is the place where we wait for the timeline on host, since we need to re-record command buffers anyways.
Expand All @@ -269,37 +258,44 @@ Instead, just wait for timeline semaphores on host to "drain" the GPU, or if you
Similar to `vkDeviceWaitIdle`, when tearing down the application, an out-of-order submission might be waiting on work which never comes, and that queue becomes deadlocked.
To alleviate this, we can make use of host signalling of timeline semaphores to unblock everything in one fell swoop.

From `TimelineSemaphore::finish()`:
From `TimelineSemaphore::finish_timeline_workers()`:

[,cpp]
----
graphics_worker.alive = false;
compute_worker.alive = false;

signal_timeline(Timeline::MAX_STAGES);

if (graphics_worker.thread.joinable())
{
graphics_worker.thread.join();
}

if (compute_worker.thread.joinable())
{
compute_worker.thread.join();
}
----

From `TimelineSemaphore::finish_timeline_workers()`:

[,cpp]
----
// Draining queues which submit out-of-order can be quite tricky, since QueueWaitIdle can deadlock for threads which want to run ahead.
// If we call Submit waiting for a semaphore which is yet to be signalled,
// QueueWaitIdle will not finish until a signal in another thread happens.
// Here's an approach we can use to safely tear down the queue.

// Drain the main thread timeline.
// The async queue might be stalled waiting on the main queue to finish rendering a future frame which it never completes,
// but we might never hit that count, since we're tearing down the application now.
wait_timeline_cpu(main_thread_timeline);

// Now we're guaranteed that the graphics timeline is at N and the async compute queue is blocked at N + num_frames + 1, waiting for N + 1 to finish.
// Since we're not reading any more in graphics queue, we can jump bump the timeline on CPU towards infinity.
// On the next loop iteration, we will exit the rendering loop and QueueWaitIdle will not be blocked on async thread anymore.
// Just bump the timeline by INT32_MAX which is min-spec for maxTimelineSemaphoreValueDifference.
// This is a useful way to mark a timeline semaphore as "permanently" signalled.
main_thread_timeline.timeline += std::numeric_limits<int32_t>::max();

// Order matters here, this works kinda like a condition variable.
// If the timeline update is observed, we should see that the worker is not alive anymore.
async_compute_worker.alive = false;
signal_timeline_cpu(main_thread_timeline, main_thread_timeline_lock);

// This will now complete in finite time.
if (async_compute_worker.thread.joinable())
{
async_compute_worker.thread.join();
}
graphics_worker.alive = false;
compute_worker.alive = false;

signal_timeline(Timeline::MAX_STAGES);

if (graphics_worker.thread.joinable())
{
graphics_worker.thread.join();
}

if (compute_worker.thread.joinable())
{
compute_worker.thread.join();
}
----

=== Out-of-order submission fallbacks for single queue implementations
Expand Down
Loading