Skip to content

Conversation

@justincdavis
Copy link

Summary

CV-CUDA has its own default stream which it will use to execute its kernels. This behavior is fine when using the Compose transform API with explicit conversion at the start and end or when using F.cvcuda_to_tensor and F.tensor_to_cvcuda. This is because CV-CUDA will synchronize its own stream when sharing with external memory. However, there are a few edge cases which I believe give us motivation to have CV-CUDA share the PyTorch current CUDA stream.

  1. When using torch.cuda.synchronize() with cvcuda.Tensor in the functional API, there is no work to synchronize on for PyTorch, since the work has been queued on a different stream.
  2. If a user specifies a specific CUDA stream with with torch.cuda.current_stream() or similar call, the work will get scheduled in a separate stream. In certain scenarios this could result in degraded performance via context switching and in general is non-intuitive behavior.
  3. Should a user want to synchronize while using the functional API and CV-CUDA backend, they would have to use the cvcuda.Stream.current.sync() call which introduces unneeded complexity/library-mixing in user code.

I propose we implement a decorator/wrapper function which will handle assignment of the current CV-CUDA stream based on the current torch.cuda stream. This allows the behavior of CV-CUDA kernels in TorchVision to function much closer variants with PyTorch tensors.

Implementation

def _cvcuda_shared_stream(fn: Callable[P, R]) -> Callable[P, R]:
    # import cvcuda once during function wrapping time
    cvcuda = _import_cvcuda()

    @functools.wraps(fn)
    def wrapper(*args: P.args, **kwargs: P.kwargs) -> R:
        # get the current torch cuda stream during function call time
        stream = torch.cuda.current_stream()

        # cvcuda.Stream supports context managers to assign the threadlocal current stream
        with cvcuda.as_stream(stream):
            # will call the cvcuda operator, which will use the current stream by default
            # since this is wrapped with a cvcuda.Stream context manager, it will use that stream
            result = fn(*args, **kwargs)

        return result

    return wrapper

Example of wrapping the existing vertical_flip kernel for CV-CUDA:

def _vertical_flip_image_cvcuda(image: "cvcuda.Tensor") -> "cvcuda.Tensor":
    return _import_cvcuda().flip(image, flipCode=0)


if CVCUDA_AVAILABLE:
    _register_kernel_internal(vertical_flip, _import_cvcuda().Tensor)(
        _cvcuda_shared_stream(_vertical_flip_image_cvcuda)
    )

Testing

As of right now, there is no testing strategy for this change in place. The naive approach would be to assert that the CV-CUDA kernels do not block without this behavior, and blocks with this behavior (via the higher-level functional version) with torch.cuda.synchronize(). An alternative could potentially use torch.cuda.Event

Feedback

I would love to get feedback on whether this change should be pursued and the testing strategy if this is behavior the team wants in TorchVision.

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 9, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/9308

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the cla signed label Dec 9, 2025
@NicolasHug
Copy link
Member

NicolasHug commented Dec 10, 2025

Thanks for the PR @justincdavis and for bringing this up. I'll have to think more, but this seems reasonable from a quick look.

Re testing, does CVCUDA expose an API to get the current stream it's working on, something like https://docs.pytorch.org/docs/stable/generated/torch.cuda.current_stream.html ? If it does, maybe a small test like this one would be enough

new_stream = torch.cuda.stream(None)

def assert_cvcuda_is_using_torch_stream():
    assert cvcuda.current_stream() == new_stream

with torch.cuda.stream(new_stream):
    _cvcuda_shared_stream(assert_cvcuda_is_using_torch_stream)()

@justincdavis
Copy link
Author

justincdavis commented Dec 10, 2025

Hi @NicolasHug CVCUDA does expose this behavior! I added a simple positive/negative test to check the handles of the two streams.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants