Skip to content

docs: expand and reorganize tutorial#5

Open
psychocoderHPC wants to merge 13 commits intodevfrom
docs/tutorial-refresh
Open

docs: expand and reorganize tutorial#5
psychocoderHPC wants to merge 13 commits intodevfrom
docs/tutorial-refresh

Conversation

@psychocoderHPC
Copy link
Copy Markdown
Owner

@psychocoderHPC psychocoderHPC commented Apr 2, 2026

Summary by CodeRabbit

  • New Features

    • Added comprehensive tutorial library with 20+ example snippets covering kernels, memory operations, algorithms, random numbers, atomics, shared memory, and warp-level operations.
    • Added backend-parameterized test infrastructure for cross-backend compatibility validation.
  • Documentation

    • Added 25+ new tutorial pages covering foundations, kernel fundamentals, hierarchical execution, algorithms, memory operations, performance tuning, and vendor interoperability.
    • Enhanced existing documentation with collapsible source file references and improved navigation structure.
  • Chores

    • Removed debug logging from build configuration.
    • Minor whitespace cleanup in documentation.

@psychocoderHPC psychocoderHPC force-pushed the docs/tutorial-refresh branch from e59ee25 to 7db2242 Compare April 2, 2026 15:43
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 8, 2026

📝 Walkthrough

Walkthrough

This pull request introduces a comprehensive tutorial overhaul for alpaka, adding 19 new C++ example/test snippets covering kernels, shared memory, algorithms, random numbers, and numerics, along with 30+ new Sphinx documentation pages that establish a structured learning path. Existing snippets are converted to backend-parameterized tests via TEMPLATE_LIST_TEST_CASE, and documentation capitalization is standardized to lowercase alpaka.

Changes

Cohort / File(s) Summary
Documentation Infrastructure
docs/source/conf.py, docs/source/index.rst
Updated Sphinx theme configuration (collapse navigation, depth) and reorganized tutorial toctree with new topic structure and adjusted maxdepth.
Core Tutorial Pages
docs/source/tutorial/foundations.rst, docs/source/tutorial/intro.rst, docs/source/tutorial/mentalModel.rst, docs/source/tutorial/execution.rst
Introduced new foundational pages defining mental model (IdxRange, FrameSpec, makeIdxMap), tutorial structure, execution configuration, and device enumeration concepts.
Kernel Tutorial Pages
docs/source/tutorial/kernels.rst, docs/source/tutorial/kernel.rst, docs/source/tutorial/hierarchy.rst, docs/source/tutorial/multidim.rst, docs/source/tutorial/sharedMemory.rst, docs/source/tutorial/memFence.rst, docs/source/tutorial/chunked.rst, docs/source/tutorial/atomics.rst, docs/source/tutorial/miniProject.rst
Added comprehensive kernel-focused tutorials covering basic kernels, execution hierarchy (blocks/threads/warps), multidimensional kernels, shared memory, memory fences, chunked frames, atomics, and a mini project combining multiple concepts.
Numerics and Algorithms Tutorials
docs/source/tutorial/numerics.rst, docs/source/tutorial/algorithms.rst, docs/source/tutorial/random.rst, docs/source/tutorial/math.rst, docs/source/tutorial/intrinsics.rst, docs/source/tutorial/warp.rst
Added tutorial pages documenting host-side algorithms, random number generation, math functions, bit intrinsics, and warp-level communication.
Data and Memory Tutorials
docs/source/tutorial/memory.rst, docs/source/tutorial/memoryOperations.rst, docs/source/tutorial/views.rst, docs/source/tutorial/device.rst, docs/source/tutorial/queue.rst, docs/source/tutorial/vector.rst, docs/source/tutorial/events.rst
Updated and added pages covering memory allocation/operations, views/subviews, device selection, queues, vector types, and event-based synchronization.
Migration and Advanced Tutorials
docs/source/tutorial/migration.rst, docs/source/tutorial/migrationMap.rst, docs/source/tutorial/portingKernel.rst, docs/source/tutorial/backendDifferences.rst, docs/source/tutorial/vendorInterop.rst, docs/source/tutorial/tuning.rst
Added migration guidance for CUDA/HIP/SYCL users, backend-specific considerations, kernel porting example, vendor interop patterns, and performance tuning strategy.
Existing Documentation Updates
docs/source/advanced/cmake.rst, docs/source/advanced/datastorage.rst, docs/source/basic/terms.rst, docs/source/basic/cheatsheet.rst, docs/source/basic/example.rst, docs/source/basic/install.rst, docs/source/basic/library.rst, docs/source/contribution/*, docs/source/dev/logging.rst
Standardized capitalization to lowercase alpaka, added "Complete Source File" collapsible sections to example pages, and cleaned up whitespace.
Example/Snippet Infrastructure
docs/snippets/example/include/docsTest.hpp, docs/snippets/example/CMakeLists.txt
Added new test header defining docs::test::TestBackends type alias for backend-parameterized tests; removed cmake status logging.
New Kernel Example Snippets
docs/snippets/example/02_execution.cpp, docs/snippets/example/08_events.cpp, docs/snippets/example/12_kernelIntro.cpp, docs/snippets/example/13_hierarchy.cpp, docs/snippets/example/16_sharedMemory.cpp, docs/snippets/example/18_multidimKernel.cpp, docs/snippets/example/22_atomics.cpp, docs/snippets/example/24_math.cpp, docs/snippets/example/26_warp.cpp, docs/snippets/example/28_chunkedFrames.cpp, docs/snippets/example/30_random.cpp, docs/snippets/example/31_monteCarloPi.cpp, docs/snippets/example/32_intrinsics.cpp, docs/snippets/example/34_memFence.cpp, docs/snippets/example/36_portingKernel.cpp, docs/snippets/example/38_vendorInterop.cpp, docs/snippets/example/40_imagePipeline.cpp
Added 17 new Catch2 templated test/example snippets covering device enumeration, events, basic kernels, hierarchy, shared memory, multidimensional kernels, atomics, math functions, warp shuffles, chunked frames, random numbers, Monte Carlo, bit intrinsics, memory fences, SAXPY porting, vendor interop, and image pipelines.
Updated Example Snippets
docs/snippets/example/05_device.cpp, docs/snippets/example/06_queue.cpp, docs/snippets/example/10_memory.cpp, docs/snippets/example/15_kernel.cpp, docs/snippets/example/20_simdKernel.cpp, docs/snippets/example/11_views.cpp
Converted test cases from single-backend TEST_CASE to TEMPLATE_LIST_TEST_CASE over docs::test::TestBackends; strengthened kernel operator signatures to require onAcc::concepts::Acc auto const&; switched device/executor selection from hardcoded values to backend-driven configuration; added early availability checks.

Sequence Diagram(s)

sequenceDiagram
    participant Host
    participant Device as onHost::Device
    participant Queue as onHost::Queue
    participant Kernel as Kernel Functor

    Host->>Host: 1. Enumerate backends via<br/>onHost::allBackends(...)
    Host->>Host: 2. Select available backend<br/>from TestBackends
    Host->>Device: 3. Create device via<br/>selector.makeDevice(0)
    Host->>Queue: 4. Create queue for device
    Host->>Queue: 5. Allocate device buffers<br/>via onHost::malloc
    Host->>Queue: 6. Copy host data to device<br/>via onHost::memcpy
    Host->>Queue: 7. Enqueue kernel execution<br/>with FrameSpec
    Queue->>Kernel: 8. Launch kernel on device<br/>with executor & accelerator
    Kernel->>Kernel: 9. Each worker iterates<br/>via makeIdxMap over range
    Kernel->>Device: 10. Perform work<br/>(atomics, shared mem, etc.)
    Queue->>Host: 11. Return after enqueue<br/>(for non-blocking queue)
    Host->>Queue: 12. onHost::wait(queue)<br/>to synchronize completion
    Host->>Queue: 13. Copy result back<br/>via onHost::memcpy
    Host->>Host: 14. Validate results<br/>via CHECK assertions
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 Whiskers twitching with tutorial delight,
New kernels hop through documentation bright,
From device selection to warp-shuffle ways,
The alpaka burrow's expanded its maze!
Backend-neutral and structured with care,
Let learners discover portability everywhere! 🎓✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: expanding and reorganizing the tutorial documentation across numerous new and updated files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/tutorial-refresh

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/snippets/example/06_queue.cpp (1)

45-45: ⚠️ Potential issue | 🟡 Minor

Fix typo in tutorial comment (Line 45).

“untile” should be “until”.

✏️ Proposed fix
-    // no wait required, enqueue will wait untile the task is finished
+    // no wait required, enqueue will wait until the task is finished
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/snippets/example/06_queue.cpp` at line 45, Replace the typo in the
inline comment inside docs/snippets/example/06_queue.cpp where the comment reads
"// no wait required, enqueue will wait untile the task is finished" by changing
"untile" to "until" so it reads "// no wait required, enqueue will wait until
the task is finished"; update the exact comment text in the file (search for the
"enqueue will wait untile" substring) to apply this simple spelling fix.
🧹 Nitpick comments (4)
docs/source/tutorial/memory.rst (1)

8-12: Consider adding an explicit cross-link to the memory-operations page.

Since this chapter now focuses on allocation concepts, a short :ref: link to the dedicated memory operations section would improve navigation for readers looking for copy/fill/memset details.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/source/tutorial/memory.rst` around lines 8 - 12, Add an explicit RTD
cross-reference to the memory operations page by inserting a short
:ref:`memory-operations` link (rendered as "memory operations") into the
paragraph that introduces allocation concepts — e.g., append or parenthetically
add "see :ref:`memory-operations` for copy/fill/memset details" after the
sentence about allocation concepts so readers can jump directly to the
copy/fill/memset reference.
docs/snippets/example/14_algorithms.cpp (1)

13-13: Unused include <bit>.

The <bit> header doesn't appear to be used anywhere in this file. Consider removing it to keep the includes clean.

Proposed fix
 `#include` <array>
-#include <bit>
 `#include` <functional>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/snippets/example/14_algorithms.cpp` at line 13, Remove the unused
include directive by deleting the line that contains `#include` <bit> from the
file (it is not referenced anywhere in this snippet), then rebuild/run tests to
ensure no compile errors; focus on the include removal and keeping other
includes intact.
docs/snippets/example/16_sharedMemory.cpp (1)

148-151: Don't size the dyn-shared cache from thread count unless that's the contract.

This currently assumes m_spec.getNumThreads().x() matches the number of cached elements. The kernel indexes cache[idx.x()] over range::frameExtent, so changing the launch to multiple frame elements per thread would under-allocate the shared buffer. Either derive the byte count from the cached frame extent or call out the 1:1 assumption explicitly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/snippets/example/16_sharedMemory.cpp` around lines 148 - 151, The
dyn-shared size calculation in operator()(auto const executor, auto const& out,
auto const& in, int factor) currently uses m_spec.getNumThreads().x() which
assumes a 1:1 mapping between threads and cached elements; change this to derive
the byte count from the cached frame extent (the number of elements indexed by
cache[idx.x()] over range::frameExtent) or, if the 1:1 mapping is intended, add
an explicit assertion/comment documenting the contract; update the size
expression to use the cached frame extent (or add an assert that
m_spec.getNumThreads().x() == cached_frame_extent) so the shared buffer is
correctly allocated when multiple frame elements per thread are launched.
docs/source/tutorial/backendDifferences.rst (1)

48-49: Consider adding an explicit cross-reference link.

The text mentions "The dedicated vendor-interop chapter" but doesn't include a :doc: link. For consistency with other tutorial pages and reader convenience, consider:

-The dedicated vendor-interop chapter shows the pattern.
+The dedicated :doc:`vendorInterop` chapter shows the pattern.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/source/tutorial/backendDifferences.rst` around lines 48 - 49, Add an
explicit Sphinx cross-reference to the vendor-interop chapter where the text
currently says "The dedicated vendor-interop chapter shows the pattern"; replace
or augment that phrase with a :doc: role linking to the vendor-interop page
(e.g. :doc:`vendor-interop` or the actual doc name used in the project) so
readers can click through directly from backendDifferences.rst.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/snippets/example/28_chunkedFrames.cpp`:
- Around line 79-81: The code currently computes numFrames by integer-dividing
hostOut.size() by frameExtent (via Vec{...} / frameExtent), which silently
truncates any remainder; before constructing numFrames and onHost::FrameSpec do
an explicit check that hostOut.size() is evenly divisible by the total number of
elements per frame (compute the frame element count from frameExtent), and if
not, fail fast (throw, assert, or log+exit) so trailing elements aren't silently
dropped; refer to frameExtent, hostOut, numFrames and onHost::FrameSpec when
adding the divisibility check and error path.

In `@docs/snippets/example/34_memFence.cpp`:
- Around line 79-81: The unbounded consumer spin using onAcc::atomicCas on
readyFlag can hang the test; replace the infinite while loop that spins on
onAcc::atomicCas(acc, &readyFlag[0u], 0u, 0u) with a bounded polling loop that
tries a fixed number of iterations (or polls with a timeout), and if the loop
exhausts without success increment mismatchCounter to report the timeout and
break out so the test fails instead of hanging; ensure you reference and update
the same readyFlag index and use mismatchCounter to record the failure.
- Around line 32-43: The second barrier (onAcc::syncBlockThreads(acc)) makes the
release/acquire memFence pair redundant; to demonstrate fence semantics replace
the barrier-based synchronization with fence-only ordering by removing the
second onAcc::syncBlockThreads(acc) call (the one immediately before reading
observedB) and keep the release fence in the tid==0 block and the acquire fence
before reading observedA; this preserves shared, tid, onAcc::memFence and the
fence semantics you intend to show.

In `@docs/snippets/example/38_vendorInterop.cpp`:
- Around line 75-79: The host-path currently uses std::transform over
input.getExtents().x(), which truncates multidimensional
alpaka::concepts::IMdSpan inputs; either restrict this overload to 1D spans or
compute the full element count by multiplying all dimensions from
input.getExtents() and use that count (and appropriate pointer arithmetic) in
std::transform; locate the host dispatch that uses input,
input.getExtents().x(), outPtr and the lambda and change it to validate/require
a 1D IMdSpan or replace input.getExtents().x() with the product of all extents
(or call alpaka::onHost::transform like the fallback) so all elements are
transformed.

In `@docs/source/tutorial/atomics.rst`:
- Around line 74-78: The docs incorrectly state the default scope as
onAcc::scope::Device (capital D); update the text to use the correct lowercase
constant onAcc::scope::device for consistency with the API and the other
examples (see onAcc::atomicAdd and onAcc::scope::block/onAcc::scope::device).

In `@docs/source/tutorial/memFence.rst`:
- Around line 45-48: Update the prose in the producer/consumer description and
the summary to explicitly state that the ready flag operations must be atomic:
mention that the producer must perform an atomic store/update (e.g., atomicExch)
to set the ready flag and the consumer must read/update it atomically (e.g.,
atomicCas or atomic load), and ensure the text near the examples that reference
atomicExch and atomicCas explicitly uses the word "atomic" so readers don’t miss
the requirement.

In `@docs/source/tutorial/random.rst`:
- Around line 31-33: Update the documentation snippet to use the lowercase
instance name used throughout the codebase: replace the type reference
`rand::interval::CO` with the constant instance `rand::interval::co` so the
examples (which use `rand::engine::Philox4x32x10` and
`rand::distribution::UniformReal<float>`) compile when copy-pasted; ensure any
other occurrences in the same file also use `rand::interval::co` instead of
`rand::interval::CO`.

---

Outside diff comments:
In `@docs/snippets/example/06_queue.cpp`:
- Line 45: Replace the typo in the inline comment inside
docs/snippets/example/06_queue.cpp where the comment reads "// no wait required,
enqueue will wait untile the task is finished" by changing "untile" to "until"
so it reads "// no wait required, enqueue will wait until the task is finished";
update the exact comment text in the file (search for the "enqueue will wait
untile" substring) to apply this simple spelling fix.

---

Nitpick comments:
In `@docs/snippets/example/14_algorithms.cpp`:
- Line 13: Remove the unused include directive by deleting the line that
contains `#include` <bit> from the file (it is not referenced anywhere in this
snippet), then rebuild/run tests to ensure no compile errors; focus on the
include removal and keeping other includes intact.

In `@docs/snippets/example/16_sharedMemory.cpp`:
- Around line 148-151: The dyn-shared size calculation in operator()(auto const
executor, auto const& out, auto const& in, int factor) currently uses
m_spec.getNumThreads().x() which assumes a 1:1 mapping between threads and
cached elements; change this to derive the byte count from the cached frame
extent (the number of elements indexed by cache[idx.x()] over
range::frameExtent) or, if the 1:1 mapping is intended, add an explicit
assertion/comment documenting the contract; update the size expression to use
the cached frame extent (or add an assert that m_spec.getNumThreads().x() ==
cached_frame_extent) so the shared buffer is correctly allocated when multiple
frame elements per thread are launched.

In `@docs/source/tutorial/backendDifferences.rst`:
- Around line 48-49: Add an explicit Sphinx cross-reference to the
vendor-interop chapter where the text currently says "The dedicated
vendor-interop chapter shows the pattern"; replace or augment that phrase with a
:doc: role linking to the vendor-interop page (e.g. :doc:`vendor-interop` or the
actual doc name used in the project) so readers can click through directly from
backendDifferences.rst.

In `@docs/source/tutorial/memory.rst`:
- Around line 8-12: Add an explicit RTD cross-reference to the memory operations
page by inserting a short :ref:`memory-operations` link (rendered as "memory
operations") into the paragraph that introduces allocation concepts — e.g.,
append or parenthetically add "see :ref:`memory-operations` for copy/fill/memset
details" after the sentence about allocation concepts so readers can jump
directly to the copy/fill/memset reference.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1d03d702-71d7-442b-aebc-c7776b784bef

📥 Commits

Reviewing files that changed from the base of the PR and between e24efe3 and 2b84202.

📒 Files selected for processing (61)
  • docs/snippets/example/02_execution.cpp
  • docs/snippets/example/05_device.cpp
  • docs/snippets/example/06_queue.cpp
  • docs/snippets/example/08_events.cpp
  • docs/snippets/example/10_memory.cpp
  • docs/snippets/example/11_views.cpp
  • docs/snippets/example/12_kernelIntro.cpp
  • docs/snippets/example/13_hierarchy.cpp
  • docs/snippets/example/14_algorithms.cpp
  • docs/snippets/example/15_kernel.cpp
  • docs/snippets/example/16_sharedMemory.cpp
  • docs/snippets/example/18_multidimKernel.cpp
  • docs/snippets/example/20_simdKernel.cpp
  • docs/snippets/example/22_atomics.cpp
  • docs/snippets/example/24_math.cpp
  • docs/snippets/example/26_warp.cpp
  • docs/snippets/example/28_chunkedFrames.cpp
  • docs/snippets/example/30_random.cpp
  • docs/snippets/example/31_monteCarloPi.cpp
  • docs/snippets/example/32_intrinsics.cpp
  • docs/snippets/example/34_memFence.cpp
  • docs/snippets/example/36_portingKernel.cpp
  • docs/snippets/example/38_vendorInterop.cpp
  • docs/snippets/example/40_imagePipeline.cpp
  • docs/snippets/example/CMakeLists.txt
  • docs/snippets/example/include/docsTest.hpp
  • docs/source/advanced/cmake.rst
  • docs/source/advanced/datastorage.rst
  • docs/source/basic/terms.rst
  • docs/source/conf.py
  • docs/source/index.rst
  • docs/source/tutorial/algorithms.rst
  • docs/source/tutorial/atomics.rst
  • docs/source/tutorial/backendDifferences.rst
  • docs/source/tutorial/chunked.rst
  • docs/source/tutorial/device.rst
  • docs/source/tutorial/events.rst
  • docs/source/tutorial/execution.rst
  • docs/source/tutorial/foundations.rst
  • docs/source/tutorial/hierarchy.rst
  • docs/source/tutorial/intrinsics.rst
  • docs/source/tutorial/intro.rst
  • docs/source/tutorial/kernel.rst
  • docs/source/tutorial/kernels.rst
  • docs/source/tutorial/math.rst
  • docs/source/tutorial/memFence.rst
  • docs/source/tutorial/memory.rst
  • docs/source/tutorial/memoryOperations.rst
  • docs/source/tutorial/mentalModel.rst
  • docs/source/tutorial/migration.rst
  • docs/source/tutorial/migrationMap.rst
  • docs/source/tutorial/miniProject.rst
  • docs/source/tutorial/multidim.rst
  • docs/source/tutorial/numerics.rst
  • docs/source/tutorial/portingKernel.rst
  • docs/source/tutorial/random.rst
  • docs/source/tutorial/sharedMemory.rst
  • docs/source/tutorial/tuning.rst
  • docs/source/tutorial/vendorInterop.rst
  • docs/source/tutorial/views.rst
  • docs/source/tutorial/warp.rst
💤 Files with no reviewable changes (1)
  • docs/snippets/example/CMakeLists.txt

Comment thread docs/snippets/example/28_chunkedFrames.cpp
Comment on lines +32 to +43
if(tid == 0u)
{
shared[0] = 10;
onAcc::memFence(acc, onAcc::scope::block, onAcc::order::release);
shared[1] = 20;
}

onAcc::syncBlockThreads(acc);

auto observedB = shared[1];
onAcc::memFence(acc, onAcc::scope::block, onAcc::order::acquire);
auto observedA = shared[0];
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n docs/snippets/example/34_memFence.cpp

Repository: psychocoderHPC/alpaka3

Length of output: 5362


The second syncBlockThreads() makes the release/acquire fence pair redundant.

The barrier at line 39 ensures all threads synchronize before reading, which means all of thread 0's writes are already visible to other threads regardless of the fence semantics. This renders the release fence at line 35 unnecessary for achieving the intended ordering in this example, turning what should demonstrate fence semantics into a barrier-based example instead.

🧰 Tools
🪛 Cppcheck (2.20.0)

[error] 40-40: Found an exit path from function with non-void return type that has missing return statement

(missingReturn)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/snippets/example/34_memFence.cpp` around lines 32 - 43, The second
barrier (onAcc::syncBlockThreads(acc)) makes the release/acquire memFence pair
redundant; to demonstrate fence semantics replace the barrier-based
synchronization with fence-only ordering by removing the second
onAcc::syncBlockThreads(acc) call (the one immediately before reading observedB)
and keep the release fence in the tid==0 block and the acquire fence before
reading observedA; this preserves shared, tid, onAcc::memFence and the fence
semantics you intend to show.

Comment on lines +79 to +81
while(onAcc::atomicCas(acc, &readyFlag[0u], 0u, 0u) == 0u)
{
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bound the consumer spin loop.

If readyFlag is never published on one backend, this test hangs the docs suite instead of failing. Turn the busy-wait into a bounded poll and report timeout through mismatchCounter.

🛡️ Proposed fix
-            while(onAcc::atomicCas(acc, &readyFlag[0u], 0u, 0u) == 0u)
-            {
-            }
+            uint32_t spins = 0u;
+            while(onAcc::atomicCas(acc, &readyFlag[0u], 0u, 0u) == 0u)
+            {
+                if(++spins == 1000000u)
+                {
+                    onAcc::atomicAdd(acc, &mismatchCounter[0u], 1u);
+                    return;
+                }
+            }
🧰 Tools
🪛 Cppcheck (2.20.0)

[error] 80-80: Found an exit path from function with non-void return type that has missing return statement

(missingReturn)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/snippets/example/34_memFence.cpp` around lines 79 - 81, The unbounded
consumer spin using onAcc::atomicCas on readyFlag can hang the test; replace the
infinite while loop that spins on onAcc::atomicCas(acc, &readyFlag[0u], 0u, 0u)
with a bounded polling loop that tries a fixed number of iterations (or polls
with a timeout), and if the loop exhausts without success increment
mismatchCounter to report the timeout and break out so the test fails instead of
hanging; ensure you reference and update the same readyFlag index and use
mismatchCounter to record the failure.

Comment on lines +75 to +79
std::transform(
input.data(),
input.data() + input.getExtents().x(),
outPtr,
[=](float value) { return scale * value + shift; });
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify current element-count logic and related usage context.
rg -n -C2 'input\.getExtents\(\)\.x\(\)' docs/snippets/example/38_vendorInterop.cpp
rg -n -C2 'AffineTransform::call\(' docs/snippets/example
rg -n -C2 'fnDispatch\(' docs/snippets/example/38_vendorInterop.cpp

Repository: psychocoderHPC/alpaka3

Length of output: 1208


🏁 Script executed:

cat -n docs/snippets/example/38_vendorInterop.cpp | sed -n '60,85p'

Repository: psychocoderHPC/alpaka3

Length of output: 1023


🏁 Script executed:

cat -n docs/snippets/example/38_vendorInterop.cpp | sed -n '40,60p'

Repository: psychocoderHPC/alpaka3

Length of output: 844


🏁 Script executed:

cat -n docs/snippets/example/38_vendorInterop.cpp | sed -n '85,110p'

Repository: psychocoderHPC/alpaka3

Length of output: 1164


🏁 Script executed:

# Check if there are any other 2D/3D usage examples in the file
rg -n 'getExtents\(\)' docs/snippets/example/38_vendorInterop.cpp
# Also check alpaka::onHost::transform to understand how it handles multidimensional spans
rg -n 'onHost::transform' docs/snippets/example/38_vendorInterop.cpp

Repository: psychocoderHPC/alpaka3

Length of output: 165


Host dispatch truncates multidimensional inputs.

The function at line 63-81 accepts any alpaka::concepts::IMdSpan (including 2D/3D), but at line 77 it only processes input.getExtents().x() elements, leaving the remaining dimensions untransformed. The Alpaka fallback at lines 44-57 correctly handles multidimensional spans via alpaka::onHost::transform. Constrain this overload to 1D spans explicitly or compute the total element count from all extents before std::transform.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/snippets/example/38_vendorInterop.cpp` around lines 75 - 79, The
host-path currently uses std::transform over input.getExtents().x(), which
truncates multidimensional alpaka::concepts::IMdSpan inputs; either restrict
this overload to 1D spans or compute the full element count by multiplying all
dimensions from input.getExtents() and use that count (and appropriate pointer
arithmetic) in std::transform; locate the host dispatch that uses input,
input.getExtents().x(), outPtr and the lambda and change it to validate/require
a 1D IMdSpan or replace input.getExtents().x() with the product of all extents
(or call alpaka::onHost::transform like the fallback) so all elements are
transformed.

Comment thread docs/source/tutorial/atomics.rst Outdated
Comment thread docs/source/tutorial/memFence.rst
Comment thread docs/source/tutorial/random.rst Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
docs/snippets/example/16_sharedMemory.cpp (1)

149-157: Avoid assert-only sizing logic for dynamic shared memory.

At Line 154, divisibility is enforced only via assert. In release builds, this can under-allocate shared memory when extents aren’t evenly divisible by block count, which can cascade into OOB accesses in DynamicScaleKernel.

Proposed hardening
         uint32_t operator()(auto const executor, auto const& out, auto const& in, int factor) const
         {
             alpaka::unused(executor, out, in, factor);
             auto const totalCachedElements = in.getExtents().product();
             auto const numBlocks = m_spec.getNumBlocks().product();
-            assert(totalCachedElements % numBlocks == 0u);
-            auto const cachedFrameExtent = totalCachedElements / numBlocks;
+            if(numBlocks == 0u)
+                return 0u;
+            assert(totalCachedElements % numBlocks == 0u);
+            auto const cachedFrameExtent = (totalCachedElements + numBlocks - 1u) / numBlocks;
             return static_cast<uint32_t>(cachedFrameExtent * sizeof(int));
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/snippets/example/16_sharedMemory.cpp` around lines 149 - 157, The sizing
currently relies on an assert inside operator()(...) which is removed in release
builds and can under-allocate dynamic shared memory; replace the assert-based
divisibility assumption with a safe runtime computation: compute numBlocks =
m_spec.getNumBlocks().product() and cachedFrameExtent = (totalCachedElements +
numBlocks - 1) / numBlocks (i.e., round up) to guarantee enough shared memory,
and optionally check for numBlocks == 0 and handle/throw/log; update the return
to use the rounded-up cachedFrameExtent * sizeof(int) so DynamicScaleKernel
never under-allocates.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@docs/snippets/example/16_sharedMemory.cpp`:
- Around line 149-157: The sizing currently relies on an assert inside
operator()(...) which is removed in release builds and can under-allocate
dynamic shared memory; replace the assert-based divisibility assumption with a
safe runtime computation: compute numBlocks = m_spec.getNumBlocks().product()
and cachedFrameExtent = (totalCachedElements + numBlocks - 1) / numBlocks (i.e.,
round up) to guarantee enough shared memory, and optionally check for numBlocks
== 0 and handle/throw/log; update the return to use the rounded-up
cachedFrameExtent * sizeof(int) so DynamicScaleKernel never under-allocates.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 5bc28376-006a-4887-aaa5-a8b5a169d4d1

📥 Commits

Reviewing files that changed from the base of the PR and between 2b84202 and bf34dc3.

📒 Files selected for processing (38)
  • docs/snippets/example/06_queue.cpp
  • docs/snippets/example/14_algorithms.cpp
  • docs/snippets/example/16_sharedMemory.cpp
  • docs/snippets/example/28_chunkedFrames.cpp
  • docs/source/advanced/cmake.rst
  • docs/source/advanced/datastorage.rst
  • docs/source/basic/cheatsheet.rst
  • docs/source/basic/example.rst
  • docs/source/basic/install.rst
  • docs/source/basic/library.rst
  • docs/source/basic/terms.rst
  • docs/source/contribution/sphinx.rst
  • docs/source/contribution/tools.rst
  • docs/source/dev/logging.rst
  • docs/source/tutorial/algorithms.rst
  • docs/source/tutorial/atomics.rst
  • docs/source/tutorial/backendDifferences.rst
  • docs/source/tutorial/chunked.rst
  • docs/source/tutorial/device.rst
  • docs/source/tutorial/events.rst
  • docs/source/tutorial/execution.rst
  • docs/source/tutorial/hierarchy.rst
  • docs/source/tutorial/intrinsics.rst
  • docs/source/tutorial/kernel.rst
  • docs/source/tutorial/math.rst
  • docs/source/tutorial/memFence.rst
  • docs/source/tutorial/memory.rst
  • docs/source/tutorial/memoryOperations.rst
  • docs/source/tutorial/miniProject.rst
  • docs/source/tutorial/multidim.rst
  • docs/source/tutorial/portingKernel.rst
  • docs/source/tutorial/queue.rst
  • docs/source/tutorial/random.rst
  • docs/source/tutorial/sharedMemory.rst
  • docs/source/tutorial/vector.rst
  • docs/source/tutorial/vendorInterop.rst
  • docs/source/tutorial/views.rst
  • docs/source/tutorial/warp.rst
💤 Files with no reviewable changes (4)
  • docs/source/contribution/sphinx.rst
  • docs/source/dev/logging.rst
  • docs/source/contribution/tools.rst
  • docs/source/basic/library.rst
✅ Files skipped from review due to trivial changes (28)
  • docs/source/basic/example.rst
  • docs/source/basic/install.rst
  • docs/source/basic/cheatsheet.rst
  • docs/source/tutorial/device.rst
  • docs/source/tutorial/queue.rst
  • docs/source/tutorial/vector.rst
  • docs/source/advanced/datastorage.rst
  • docs/source/tutorial/events.rst
  • docs/source/tutorial/miniProject.rst
  • docs/source/tutorial/multidim.rst
  • docs/source/tutorial/hierarchy.rst
  • docs/source/tutorial/atomics.rst
  • docs/source/tutorial/math.rst
  • docs/source/tutorial/warp.rst
  • docs/source/tutorial/backendDifferences.rst
  • docs/source/tutorial/views.rst
  • docs/source/tutorial/vendorInterop.rst
  • docs/source/tutorial/memFence.rst
  • docs/source/tutorial/chunked.rst
  • docs/source/tutorial/execution.rst
  • docs/source/tutorial/kernel.rst
  • docs/source/tutorial/sharedMemory.rst
  • docs/source/tutorial/portingKernel.rst
  • docs/source/tutorial/algorithms.rst
  • docs/source/tutorial/intrinsics.rst
  • docs/source/tutorial/random.rst
  • docs/source/advanced/cmake.rst
  • docs/source/tutorial/memoryOperations.rst
🚧 Files skipped from review as they are similar to previous changes (2)
  • docs/source/basic/terms.rst
  • docs/source/tutorial/memory.rst

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants