-
Notifications
You must be signed in to change notification settings - Fork 7.2k
CV-CUDA Kernels Inherit torch.cuda.current_stream() #9308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/9308
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Thanks for the PR @justincdavis and for bringing this up. I'll have to think more, but this seems reasonable from a quick look. Re testing, does CVCUDA expose an API to get the current stream it's working on, something like https://docs.pytorch.org/docs/stable/generated/torch.cuda.current_stream.html ? If it does, maybe a small test like this one would be enough new_stream = torch.cuda.stream(None)
def assert_cvcuda_is_using_torch_stream():
assert cvcuda.current_stream() == new_stream
with torch.cuda.stream(new_stream):
_cvcuda_shared_stream(assert_cvcuda_is_using_torch_stream)() |
|
Hi @NicolasHug CVCUDA does expose this behavior! I added a simple positive/negative test to check the handles of the two streams. |
Summary
CV-CUDA has its own default stream which it will use to execute its kernels. This behavior is fine when using the
Composetransform API with explicit conversion at the start and end or when usingF.cvcuda_to_tensorandF.tensor_to_cvcuda. This is because CV-CUDA will synchronize its own stream when sharing with external memory. However, there are a few edge cases which I believe give us motivation to have CV-CUDA share the PyTorch current CUDA stream.torch.cuda.synchronize()withcvcuda.Tensorin the functional API, there is no work to synchronize on for PyTorch, since the work has been queued on a different stream.with torch.cuda.current_stream()or similar call, the work will get scheduled in a separate stream. In certain scenarios this could result in degraded performance via context switching and in general is non-intuitive behavior.cvcuda.Stream.current.sync()call which introduces unneeded complexity/library-mixing in user code.I propose we implement a decorator/wrapper function which will handle assignment of the current CV-CUDA stream based on the current
torch.cudastream. This allows the behavior of CV-CUDA kernels in TorchVision to function much closer variants with PyTorch tensors.Implementation
Example of wrapping the existing vertical_flip kernel for CV-CUDA:
Testing
As of right now, there is no testing strategy for this change in place. The naive approach would be to assert that the CV-CUDA kernels do not block without this behavior, and blocks with this behavior (via the higher-level functional version) with
torch.cuda.synchronize(). An alternative could potentially usetorch.cuda.EventFeedback
I would love to get feedback on whether this change should be pursued and the testing strategy if this is behavior the team wants in TorchVision.