Fix shared memory offset not reset between CUDA kernels.#442
Fix shared memory offset not reset between CUDA kernels.#442
Conversation
tests/python/test_shared_array.py
Outdated
|
|
||
| @test_utils.test(arch=[qd.cuda], print_full_traceback=False) | ||
| def test_shared_array_not_accumulated_across_offloads(): | ||
| """Shared memory from one offloaded task must not leak into the next.""" |
There was a problem hiding this comment.
add a comment about how this test works conceptually.
There was a problem hiding this comment.
notably there seem to be no asserts in this test?
There was a problem hiding this comment.
I could add asserts. For now it is just checking that it does not crash.
There was a problem hiding this comment.
yes, let's have some asserts please.
|
Could you explain step by step:
|
Done (see PR description and unit test description). |
|
No AI were involved in this PR, including its description, code comments, and conversations. I take full responsibility for the lines added and removed in this PR. I'm confident that this PR is rock solid and does not introduce any additional bug that was not preexisting. I won't blame any issue on anybody or anything but me. |
|
|
||
| @test_utils.test(arch=[qd.cuda, qd.metal]) | ||
| def test_shared_array_not_accumulated_across_offloads(): | ||
| # Execute 2 successive offloaded tasks both allocating more than half of |
There was a problem hiding this comment.
Thanks.
Could we have a test where the 'type' of the shared memory in each of several offloaded tasks, within the same qd kernel, is different in some way? (eg scalar vs vector vs matrix; and maybe f32 vs i8 vs f64 etc)
There was a problem hiding this comment.
I cannot without breaking support of Metal.
There was a problem hiding this comment.
Well, actually I can if limited to 32bits.
There was a problem hiding this comment.
I was thinking like:
@pytest.mark.parametrize("dtype1", [qd.u8, qd.i8, qd.f32])
@pytest.mark.parametrize("dtype2", [qd.u8, qd.i8, qd.f32])
Although to be honest, now that I've understood the issue, it seems the tests I asked for don't actually match what was broken. 🤔 Probalby good to test with different types though.
Perhaps the test should be:
- always use the same shape and type for both arrays
- parametrize shape over [(100,), (100, 50),]
- paraemterize type over [qd.u8, qd.i8, qd.f32]
so conceputally:
@parametrize shape over [(100,), (100, 50),]
@paraemterize dtype over [qd.u8, qd.i8, qd.f32]
def (dtype, shape):
dtype1 = dtype2 = dtype
shape1 = shape2 = shape
... test here...
Thoughts?
There was a problem hiding this comment.
I'm not ok with this. This objective of this unit test is not to just test what was broken. The state already clearly state what was broken. It validates a large principle: interaction between shared memory of successive offloaded tasks. It is not because most of these test cases were passing before that it should be modified. The current version catches the issue and it is not the goal of this test to test all possible variant of tensor types. In this unrelated. This would be part of a follow up.
There was a problem hiding this comment.
There are two objectives of unit tests on a fix I feel:
- reproduce the bug, so we can check that
a. our understanding of the cause of the bug is at least plausible
b. our fix fixes the issue, conditional on our understanding being correct - the fix doesn't break new things
In this case, for 1. a single test, that has two offloaded tasks, both using the same type and shape of memory is sufficient. Assuming it has an assert checking the allocated shared memory for the second offloaded task.
For 2., you are creating a new type, with a shape and an element type. Two things that could go wrong:
- somehow you are hardcoding the element type in the new created type, and it only works on eg f32
- somehow you are hardcoding the shape in the new created type, and it only works on eg 1d shapes
For 2. therefore we shosuld try:
- at least two different elemetn types, check taht works ok
- at least two different shapes, check that works ok
If you can point me to existing tests for 2., then we don't need to add new tests for this. Please paste urls to those tests into this thread.
There was a problem hiding this comment.
Assuming it has an assert checking the allocated shared memory for the second offloaded task.
(Or something that is a bit more deterministic than checking for a crash, I feel.)
There was a problem hiding this comment.
(Note that I realized about 15 minutes after posting my initial parametrize request that we need the shape to be size 65536 right? (or possibly 65536 / sizeof(elemetn) ?). So, then our shape should be [65536 / sizeof(element)] and [sqrt(65536), sqrt(65536) / sizeof(element)] (with appropriate int casts/rounding)).
tests/python/test_shared_array.py
Outdated
| if qd.cfg.arch == qd.cuda: | ||
| max_shared_bytes = qd.lang.impl.get_max_shared_memory_bytes() | ||
| else: | ||
| # Metal guarantees 32KB of threadgroup memory |
There was a problem hiding this comment.
didn't you say elsewhere that AMD provides 64?
There was a problem hiding this comment.
I feel it would be more portable to simply put this information into the function above though, so it's all in one place?
There was a problem hiding this comment.
didn't you say elsewhere that AMD provides 64?
It just happens that it is currently the case for most AMD GPU, but this could change (and maybe very old AMD GPUs are using smaller size I don't know.
I feel it would be more portable to simply put this information into the function above though, so it's all in one place?
I don't want to hardcode something that is just an educated guess in a function that is supposed to provide the true information.
There was a problem hiding this comment.
sure. but can we put all this knowledge into the function, rather than hardcoding it sprayed around the code?
There was a problem hiding this comment.
I feel that whateve heuristic we are using (eg "always 32kb") should be localized in get_max_shared_memory_bytes function please.
There was a problem hiding this comment.
What about adding an optional argument def get_max_shared_memory_bytes(strict: bool = True) -> int that would return best guess available if False?
There was a problem hiding this comment.
what happens if pass in False? throws ecxeption?
There was a problem hiding this comment.
Yes, if you pass nothing or false, then it throws an exception.
There was a problem hiding this comment.
Updated. I finally decided to use this signature def get_max_shared_memory_bytes(*, strict):. I think it is a reasonable compromise. Nothing happening behind the scene without the user being aware of it.
tests/python/test_shared_array.py
Outdated
| qd.simt.block.sync() | ||
| a[i] = acc | ||
|
|
||
| # gpu_graph requires device-resident ndarrays |
There was a problem hiding this comment.
cf passing in numpy arrays?
There was a problem hiding this comment.
Added more details.
tests/python/test_shared_array.py
Outdated
| d_np = np.random.randn(N).astype(np.float32) | ||
|
|
||
| @qd.kernel | ||
| def calc( |
There was a problem hiding this comment.
I nkow it's not part of your changes, but could we rename this calc_reference pleaes? It was confusing for me until I got to line 142, and then I realized how these two functions relate to each other.
There was a problem hiding this comment.
also could we add a comment about what the kernel is doing. Maybe it's doing a dot product? (or perhaps an outer product?). If so, can we call it dot_product_reference and dot_product_shared_array (or outer product, if it's an outer product)?
0467f46 to
51bc424
Compare
f9848ee to
0670652
Compare
98e5da1 to
be5e613
Compare
be5e613 to
ec3a152
Compare
|
Could you update teh PR title please |
This PR fixes 2 independent bugs related to large shared memory:
[1] CUDA Graph not supporting large shared memory at all
[2] All CUDA kernels of the same compilation unit (ie part of the same Quadrants kernel) are sharing the same pool of singleton tensor types for efficiency. The current version was mutating in-place the shape of tensor types corresponding to shared memory, thereby corrupting other tasks being compiled at the same time. The effect of the corruption is that the amount of the shared memory that will be requested by other tasks will be 0 (aka
shared_array_bytes = 0becausetensor_type->get_num_elements() == 0from now on). Because of this, no shared memory will be available for all other tasks, leading to illegal memory accesses at runtime.To address bug [1], flag 'CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES' needed to be toggled on in kernel context, similar to what we are already doing for "classical" CUDA kernels.
To address bug [2], a new type instance of correct dtype and size 0 is created specifically for large shared memory in particular (so-called "dynamically allocated shared memory").
Note that I just discovered these limitations:
I suggest to address them in a latter stage.