Skip to content

Fix shared memory offset not reset between CUDA kernels.#442

Open
duburcqa wants to merge 15 commits intomainfrom
duburcqa/fix_shared_mem_offset
Open

Fix shared memory offset not reset between CUDA kernels.#442
duburcqa wants to merge 15 commits intomainfrom
duburcqa/fix_shared_mem_offset

Conversation

@duburcqa
Copy link
Copy Markdown
Contributor

@duburcqa duburcqa commented Mar 31, 2026

This PR fixes 2 independent bugs related to large shared memory:
[1] CUDA Graph not supporting large shared memory at all
[2] All CUDA kernels of the same compilation unit (ie part of the same Quadrants kernel) are sharing the same pool of singleton tensor types for efficiency. The current version was mutating in-place the shape of tensor types corresponding to shared memory, thereby corrupting other tasks being compiled at the same time. The effect of the corruption is that the amount of the shared memory that will be requested by other tasks will be 0 (aka shared_array_bytes = 0 because tensor_type->get_num_elements() == 0 from now on). Because of this, no shared memory will be available for all other tasks, leading to illegal memory accesses at runtime.

To address bug [1], flag 'CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES' needed to be toggled on in kernel context, similar to what we are already doing for "classical" CUDA kernels.

To address bug [2], a new type instance of correct dtype and size 0 is created specifically for large shared memory in particular (so-called "dynamically allocated shared memory").


Note that I just discovered these limitations:

// By default, CUDA could allocate up to 48KB static shared arrays.
// It requires dynamic shared memory to allocate a larger array.
// Therefore, when one shared array request for size greater than 48KB,
// we switch it to dynamic allocation.
// In current version, only one dynamic instance is allowed.
// TODO: remove the limit.
constexpr std::size_t cuda_dynamic_shared_array_threshold_bytes = 49152;

// use for auto mesh_local to determine shared-mem size per block (in bytes)
// TODO: get this at runtime
constexpr std::size_t default_shared_mem_size = 65536;

I suggest to address them in a latter stage.


@test_utils.test(arch=[qd.cuda], print_full_traceback=False)
def test_shared_array_not_accumulated_across_offloads():
"""Shared memory from one offloaded task must not leak into the next."""
Copy link
Copy Markdown
Collaborator

@hughperkins hughperkins Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment about how this test works conceptually.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notably there seem to be no asserts in this test?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could add asserts. For now it is just checking that it does not crash.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, let's have some asserts please.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@hughperkins
Copy link
Copy Markdown
Collaborator

Could you explain step by step:

  • what is your hypothesis for what the bug is?
  • how this PR addresses it?

@duburcqa
Copy link
Copy Markdown
Contributor Author

Could you explain step by step:

Done (see PR description and unit test description).

@duburcqa
Copy link
Copy Markdown
Contributor Author

duburcqa commented Mar 31, 2026

No AI were involved in this PR, including its description, code comments, and conversations. I take full responsibility for the lines added and removed in this PR. I'm confident that this PR is rock solid and does not introduce any additional bug that was not preexisting. I won't blame any issue on anybody or anything but me.


@test_utils.test(arch=[qd.cuda, qd.metal])
def test_shared_array_not_accumulated_across_offloads():
# Execute 2 successive offloaded tasks both allocating more than half of
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Could we have a test where the 'type' of the shared memory in each of several offloaded tasks, within the same qd kernel, is different in some way? (eg scalar vs vector vs matrix; and maybe f32 vs i8 vs f64 etc)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot without breaking support of Metal.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, actually I can if limited to 32bits.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking like:

@pytest.mark.parametrize("dtype1", [qd.u8, qd.i8, qd.f32])
@pytest.mark.parametrize("dtype2", [qd.u8, qd.i8, qd.f32])

Although to be honest, now that I've understood the issue, it seems the tests I asked for don't actually match what was broken. 🤔 Probalby good to test with different types though.

Perhaps the test should be:

  • always use the same shape and type for both arrays
  • parametrize shape over [(100,), (100, 50),]
  • paraemterize type over [qd.u8, qd.i8, qd.f32]
    so conceputally:
@parametrize shape over [(100,), (100, 50),]
@paraemterize dtype over [qd.u8, qd.i8, qd.f32]
def (dtype, shape):
    dtype1 = dtype2 = dtype
     shape1 = shape2 = shape
    ... test here...

Thoughts?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not ok with this. This objective of this unit test is not to just test what was broken. The state already clearly state what was broken. It validates a large principle: interaction between shared memory of successive offloaded tasks. It is not because most of these test cases were passing before that it should be modified. The current version catches the issue and it is not the goal of this test to test all possible variant of tensor types. In this unrelated. This would be part of a follow up.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two objectives of unit tests on a fix I feel:

  1. reproduce the bug, so we can check that
    a. our understanding of the cause of the bug is at least plausible
    b. our fix fixes the issue, conditional on our understanding being correct
  2. the fix doesn't break new things

In this case, for 1. a single test, that has two offloaded tasks, both using the same type and shape of memory is sufficient. Assuming it has an assert checking the allocated shared memory for the second offloaded task.

For 2., you are creating a new type, with a shape and an element type. Two things that could go wrong:

  • somehow you are hardcoding the element type in the new created type, and it only works on eg f32
  • somehow you are hardcoding the shape in the new created type, and it only works on eg 1d shapes

For 2. therefore we shosuld try:

  • at least two different elemetn types, check taht works ok
  • at least two different shapes, check that works ok

If you can point me to existing tests for 2., then we don't need to add new tests for this. Please paste urls to those tests into this thread.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming it has an assert checking the allocated shared memory for the second offloaded task.

(Or something that is a bit more deterministic than checking for a crash, I feel.)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Note that I realized about 15 minutes after posting my initial parametrize request that we need the shape to be size 65536 right? (or possibly 65536 / sizeof(elemetn) ?). So, then our shape should be [65536 / sizeof(element)] and [sqrt(65536), sqrt(65536) / sizeof(element)] (with appropriate int casts/rounding)).

if qd.cfg.arch == qd.cuda:
max_shared_bytes = qd.lang.impl.get_max_shared_memory_bytes()
else:
# Metal guarantees 32KB of threadgroup memory
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't you say elsewhere that AMD provides 64?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it would be more portable to simply put this information into the function above though, so it's all in one place?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't you say elsewhere that AMD provides 64?

It just happens that it is currently the case for most AMD GPU, but this could change (and maybe very old AMD GPUs are using smaller size I don't know.

I feel it would be more portable to simply put this information into the function above though, so it's all in one place?

I don't want to hardcode something that is just an educated guess in a function that is supposed to provide the true information.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. but can we put all this knowledge into the function, rather than hardcoding it sprayed around the code?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that whateve heuristic we are using (eg "always 32kb") should be localized in get_max_shared_memory_bytes function please.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree.

Copy link
Copy Markdown
Contributor Author

@duburcqa duburcqa Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about adding an optional argument def get_max_shared_memory_bytes(strict: bool = True) -> int that would return best guess available if False?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if pass in False? throws ecxeption?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if you pass nothing or false, then it throws an exception.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. I finally decided to use this signature def get_max_shared_memory_bytes(*, strict):. I think it is a reasonable compromise. Nothing happening behind the scene without the user being aware of it.

qd.simt.block.sync()
a[i] = acc

# gpu_graph requires device-resident ndarrays
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cf passing in numpy arrays?

Copy link
Copy Markdown
Contributor Author

@duburcqa duburcqa Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more details.

d_np = np.random.randn(N).astype(np.float32)

@qd.kernel
def calc(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I nkow it's not part of your changes, but could we rename this calc_reference pleaes? It was confusing for me until I got to line 142, and then I realized how these two functions relate to each other.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also could we add a comment about what the kernel is doing. Maybe it's doing a dot product? (or perhaps an outer product?). If so, can we call it dot_product_reference and dot_product_shared_array (or outer product, if it's an outer product)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@duburcqa duburcqa force-pushed the duburcqa/fix_shared_mem_offset branch from 0467f46 to 51bc424 Compare March 31, 2026 22:18
@duburcqa duburcqa force-pushed the duburcqa/fix_shared_mem_offset branch from f9848ee to 0670652 Compare March 31, 2026 22:36
@duburcqa duburcqa force-pushed the duburcqa/fix_shared_mem_offset branch from 98e5da1 to be5e613 Compare April 1, 2026 09:18
@duburcqa duburcqa force-pushed the duburcqa/fix_shared_mem_offset branch from be5e613 to ec3a152 Compare April 1, 2026 09:20
@hughperkins
Copy link
Copy Markdown
Collaborator

Could you update teh PR title please

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants