Fix shared memory offset not reset between CUDA kernels. by duburcqa · Pull Request #442 · Genesis-Embodied-AI/quadrants

duburcqa · 2026-03-31T16:12:07Z

This PR fixes 2 independent bugs related to large shared memory:
[1] CUDA Graph not supporting large shared memory at all
[2] All CUDA kernels of the same compilation unit (ie part of the same Quadrants kernel) are sharing the same pool of singleton tensor types for efficiency. The current version was mutating in-place the shape of tensor types corresponding to shared memory, thereby corrupting other tasks being compiled at the same time. The effect of the corruption is that the amount of the shared memory that will be requested by other tasks will be 0 (aka shared_array_bytes = 0 because tensor_type->get_num_elements() == 0 from now on). Because of this, no shared memory will be available for all other tasks, leading to illegal memory accesses at runtime.

To address bug [1], flag 'CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES' needed to be toggled on in kernel context, similar to what we are already doing for "classical" CUDA kernels.

To address bug [2], a new type instance of correct dtype and size 0 is created specifically for large shared memory in particular (so-called "dynamically allocated shared memory").

Note that I just discovered these limitations:

// By default, CUDA could allocate up to 48KB static shared arrays.
// It requires dynamic shared memory to allocate a larger array.
// Therefore, when one shared array request for size greater than 48KB,
// we switch it to dynamic allocation.
// In current version, only one dynamic instance is allowed.
// TODO: remove the limit.
constexpr std::size_t cuda_dynamic_shared_array_threshold_bytes = 49152;

// use for auto mesh_local to determine shared-mem size per block (in bytes)
// TODO: get this at runtime
constexpr std::size_t default_shared_mem_size = 65536;

I suggest to address them in a latter stage.

quadrants/codegen/cuda/codegen_cuda.cpp

hughperkins · 2026-03-31T16:21:11Z

tests/python/test_shared_array.py


+@test_utils.test(arch=[qd.cuda], print_full_traceback=False)
+def test_shared_array_not_accumulated_across_offloads():
+    """Shared memory from one offloaded task must not leak into the next."""


add a comment about how this test works conceptually.

notably there seem to be no asserts in this test?

I could add asserts. For now it is just checking that it does not crash.

yes, let's have some asserts please.

tests/python/test_shared_array.py

quadrants/python/export_lang.cpp

hughperkins · 2026-03-31T18:18:19Z

Could you explain step by step:

what is your hypothesis for what the bug is?
how this PR addresses it?

duburcqa · 2026-03-31T19:34:22Z

Could you explain step by step:

Done (see PR description and unit test description).

duburcqa · 2026-03-31T19:38:57Z

No AI were involved in this PR, including its description, code comments, and conversations. I take full responsibility for the lines added and removed in this PR. I'm confident that this PR is rock solid and does not introduce any additional bug that was not preexisting. I won't blame any issue on anybody or anything but me.

quadrants/codegen/cuda/codegen_cuda.cpp

quadrants/runtime/cuda/gpu_graph_manager.cpp

hughperkins · 2026-03-31T19:48:26Z

tests/python/test_shared_array.py


+@test_utils.test(arch=[qd.cuda, qd.metal])
+def test_shared_array_not_accumulated_across_offloads():
+    # Execute 2 successive offloaded tasks both allocating more than half of


Thanks.

Could we have a test where the 'type' of the shared memory in each of several offloaded tasks, within the same qd kernel, is different in some way? (eg scalar vs vector vs matrix; and maybe f32 vs i8 vs f64 etc)

I cannot without breaking support of Metal.

Well, actually I can if limited to 32bits.

I was thinking like:

@pytest.mark.parametrize("dtype1", [qd.u8, qd.i8, qd.f32]) @pytest.mark.parametrize("dtype2", [qd.u8, qd.i8, qd.f32])

Although to be honest, now that I've understood the issue, it seems the tests I asked for don't actually match what was broken. 🤔 Probalby good to test with different types though.

Perhaps the test should be:

always use the same shape and type for both arrays

parametrize shape over [(100,), (100, 50),]

paraemterize type over [qd.u8, qd.i8, qd.f32]
so conceputally:

@parametrize shape over [(100,), (100, 50),] @paraemterize dtype over [qd.u8, qd.i8, qd.f32] def (dtype, shape): dtype1 = dtype2 = dtype shape1 = shape2 = shape ... test here...

Thoughts?

I'm not ok with this. This objective of this unit test is not to just test what was broken. The state already clearly state what was broken. It validates a large principle: interaction between shared memory of successive offloaded tasks. It is not because most of these test cases were passing before that it should be modified. The current version catches the issue and it is not the goal of this test to test all possible variant of tensor types. In this unrelated. This would be part of a follow up.

There are two objectives of unit tests on a fix I feel:

reproduce the bug, so we can check that
a. our understanding of the cause of the bug is at least plausible
b. our fix fixes the issue, conditional on our understanding being correct

the fix doesn't break new things

In this case, for 1. a single test, that has two offloaded tasks, both using the same type and shape of memory is sufficient. Assuming it has an assert checking the allocated shared memory for the second offloaded task.

For 2., you are creating a new type, with a shape and an element type. Two things that could go wrong:

somehow you are hardcoding the element type in the new created type, and it only works on eg f32

somehow you are hardcoding the shape in the new created type, and it only works on eg 1d shapes

For 2. therefore we shosuld try:

at least two different elemetn types, check taht works ok

at least two different shapes, check that works ok

If you can point me to existing tests for 2., then we don't need to add new tests for this. Please paste urls to those tests into this thread.

Assuming it has an assert checking the allocated shared memory for the second offloaded task.

(Or something that is a bit more deterministic than checking for a crash, I feel.)

(Note that I realized about 15 minutes after posting my initial parametrize request that we need the shape to be size 65536 right? (or possibly 65536 / sizeof(elemetn) ?). So, then our shape should be [65536 / sizeof(element)] and [sqrt(65536), sqrt(65536) / sizeof(element)] (with appropriate int casts/rounding)).

hughperkins · 2026-03-31T21:19:32Z

tests/python/test_shared_array.py

+    if qd.cfg.arch == qd.cuda:
+        max_shared_bytes = qd.lang.impl.get_max_shared_memory_bytes()
+    else:
+        # Metal guarantees 32KB of threadgroup memory


didn't you say elsewhere that AMD provides 64?

I feel it would be more portable to simply put this information into the function above though, so it's all in one place?

didn't you say elsewhere that AMD provides 64?

It just happens that it is currently the case for most AMD GPU, but this could change (and maybe very old AMD GPUs are using smaller size I don't know.

I feel it would be more portable to simply put this information into the function above though, so it's all in one place?

I don't want to hardcode something that is just an educated guess in a function that is supposed to provide the true information.

sure. but can we put all this knowledge into the function, rather than hardcoding it sprayed around the code?

I feel that whateve heuristic we are using (eg "always 32kb") should be localized in get_max_shared_memory_bytes function please.

I disagree.

What about adding an optional argument def get_max_shared_memory_bytes(strict: bool = True) -> int that would return best guess available if False?

what happens if pass in False? throws ecxeption?

Yes, if you pass nothing or false, then it throws an exception.

Updated. I finally decided to use this signature def get_max_shared_memory_bytes(*, strict):. I think it is a reasonable compromise. Nothing happening behind the scene without the user being aware of it.

tests/python/test_shared_array.py

hughperkins · 2026-03-31T21:34:53Z

tests/python/test_shared_array.py

                qd.simt.block.sync()
            a[i] = acc

+    # gpu_graph requires device-resident ndarrays


cf passing in numpy arrays?

Added more details.

hughperkins · 2026-03-31T21:35:44Z

tests/python/test_shared_array.py

+    d_np = np.random.randn(N).astype(np.float32)

    @qd.kernel
    def calc(


I nkow it's not part of your changes, but could we rename this calc_reference pleaes? It was confusing for me until I got to line 142, and then I realized how these two functions relate to each other.

also could we add a comment about what the kernel is doing. Maybe it's doing a dot product? (or perhaps an outer product?). If so, can we call it dot_product_reference and dot_product_shared_array (or outer product, if it's an outer product)?

quadrants/codegen/cuda/codegen_cuda.cpp

tests/python/test_shared_array.py

…es'.

hughperkins · 2026-04-01T10:55:19Z

Could you update teh PR title please

Fix shared memory offset not reset between CUDA kernels.

3c33237

duburcqa mentioned this pull request Mar 31, 2026

[MISC] Add support of opt-in shared memory for tiled hessian to improve performance. Genesis-Embodied-AI/Genesis#2629

Open

7 tasks

hughperkins reviewed Mar 31, 2026

View reviewed changes

quadrants/codegen/cuda/codegen_cuda.cpp Outdated Show resolved Hide resolved

hughperkins reviewed Mar 31, 2026

View reviewed changes

tests/python/test_shared_array.py Outdated Show resolved Hide resolved

duburcqa added 2 commits March 31, 2026 19:28

Address comments.

adc46cf

Fix reboot.

edd5d80

hughperkins reviewed Mar 31, 2026

View reviewed changes

quadrants/python/export_lang.cpp Show resolved Hide resolved

duburcqa added 2 commits March 31, 2026 20:29

Fix support of CUDA graph.

5ccad5a

Add more comments.

6b7cdf4

hughperkins reviewed Mar 31, 2026

View reviewed changes

quadrants/codegen/cuda/codegen_cuda.cpp Show resolved Hide resolved

hughperkins reviewed Mar 31, 2026

View reviewed changes

quadrants/runtime/cuda/gpu_graph_manager.cpp Show resolved Hide resolved

hughperkins reviewed Mar 31, 2026

View reviewed changes

duburcqa added 4 commits March 31, 2026 22:36

Address PR's comments.

9d0f25b

Fix unit test.

7a8ebb9

Fix size.

4f9ec94

Add more details.

0f74d72

hughperkins reviewed Mar 31, 2026

View reviewed changes

tests/python/test_shared_array.py Outdated Show resolved Hide resolved

hughperkins reviewed Mar 31, 2026

View reviewed changes

tests/python/test_shared_array.py Outdated Show resolved Hide resolved

hughperkins reviewed Mar 31, 2026

View reviewed changes

tests/python/test_shared_array.py Show resolved Hide resolved

hughperkins reviewed Mar 31, 2026

View reviewed changes

tests/python/test_shared_array.py Show resolved Hide resolved

hughperkins reviewed Mar 31, 2026

View reviewed changes

quadrants/codegen/cuda/codegen_cuda.cpp Outdated Show resolved Hide resolved

Address comments.

51bc424

duburcqa force-pushed the duburcqa/fix_shared_mem_offset branch from 0467f46 to 51bc424 Compare March 31, 2026 22:18

duburcqa added 2 commits April 1, 2026 00:28

Address comments.

d56ce60

Address comments.

0670652

duburcqa force-pushed the duburcqa/fix_shared_mem_offset branch from f9848ee to 0670652 Compare March 31, 2026 22:36

Ensure the compile cannot optimize away shared memory.

142eb08

hughperkins reviewed Mar 31, 2026

View reviewed changes

tests/python/test_shared_array.py Show resolved Hide resolved

duburcqa force-pushed the duburcqa/fix_shared_mem_offset branch from 98e5da1 to be5e613 Compare April 1, 2026 09:18

Add support of conservative lower bound to 'get_max_shared_memory_byt…

ec3a152

…es'.

duburcqa force-pushed the duburcqa/fix_shared_mem_offset branch from be5e613 to ec3a152 Compare April 1, 2026 09:20

Fix CPU arch.

cc7588e

Conversation

duburcqa commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

hughperkins Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hughperkins commented Mar 31, 2026

Uh oh!

duburcqa commented Mar 31, 2026

Uh oh!

duburcqa commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duburcqa Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duburcqa Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duburcqa commented Mar 31, 2026 •

edited

Loading

hughperkins Mar 31, 2026 •

edited

Loading

duburcqa commented Mar 31, 2026 •

edited

Loading

duburcqa Mar 31, 2026 •

edited

Loading

duburcqa Mar 31, 2026 •

edited

Loading