Conversation
| QD_ERROR_IF(!cond_kernel_func_, | ||
| "Condition kernel not available; cannot build graph_do_while"); | ||
| if (!cond_kernel_func_) { | ||
| // Device does not support graph_do_while (requires SM 9.0+). |
There was a problem hiding this comment.
What I want to happen is:
- if we are on SM90+, and the conditional kernel is not available => throw an exception; ths user must install the pre-requiaites so they get the full performance out of their SM90. We don't want people using an SM90, assuming they are getting full performance, and then thinking Genesis is slow, I feel
- if < SM90, then fall back to host side do while loop.
Quetions:
- To what extent does this change align with these two constraints?
- To what extent are we testing both conditions? (or at least, testing the first constraint?)
There was a problem hiding this comment.
I see. Updated accordingly.
There was a problem hiding this comment.
To what extent are we testing both conditions? (or at least, testing the first constraint?)
Just skimmed through the unit tests and it seems we are. I will double check tomorrow.
There was a problem hiding this comment.
if we are on SM90+, and the conditional kernel is not available => throw an exception;
I have zero idea about how to test this. You are asking me to test something that requires a faulty device.
There was a problem hiding this comment.
if < SM90, then fall back to host side do while loop.
This is already unit tested.
|
|
||
|
|
||
| @test_utils.test() | ||
| @test_utils.test(arch=qd.gpu) |
There was a problem hiding this comment.
should work on ALL bakcends, including cpu. Please remove the filter on gpu. Why do you feel this test should not run on cpu?
There was a problem hiding this comment.
Just wanted to be conservative. I can remove this guard.
There was a problem hiding this comment.
Like, we only need gpu in Genesis.
|
|
||
|
|
||
| @test_utils.test() | ||
| @test_utils.test(arch=qd.gpu) |
There was a problem hiding this comment.
again, gpu graph is supported on cpu. Perhaps I should rename it 🤔 (I had a similar challenge with Opus, when it was named 'cuda_graph': it refused to run it on anthing other than cuda. So I renamed it. perhaps I should just call it graph?)
There was a problem hiding this comment.
So I renamed it. perhaps I should just call it graph?)
I think you should yes.
|
|
||
|
|
||
| @test_utils.test() | ||
| @test_utils.test(arch=qd.gpu) |
|
|
||
|
|
||
| @test_utils.test() | ||
| @test_utils.test(arch=qd.gpu) |
|
|
||
|
|
||
| @test_utils.test() | ||
| @test_utils.test(arch=qd.gpu) |
|
|
||
|
|
||
| @test_utils.test() | ||
| @test_utils.test(arch=qd.gpu) |
|
|
||
|
|
||
| @test_utils.test() | ||
| @test_utils.test(arch=qd.gpu) |
|
|
||
|
|
||
| @test_utils.test() | ||
| @test_utils.test(arch=qd.gpu) |
b9c7936 to
a879710
Compare
This PR checks SM compatibility if native GPU graph is not supported by a given CUDA kernel at runtime, and fallback to host-side do-while loop if not supported, instead of systematically raising an exception.
Moreover, message about fallbacking to host-side do-while loop if GPU Graph is not supported has been demoted from warning to information. Warning is supposed to require user action to be fixed. Here, there is nothing to fix, it is part of the expected workflow. Still, it is nice to inform the user that it is not doing what one may expect intuitively.
Resolves Genesis-Embodied-AI/Genesis#2631