Skip to content

Conversation

@stiepan
Copy link
Member

@stiepan stiepan commented Nov 21, 2025

Description

  1. Adds ccx.utils.StridedLayout for describing ndim-tensor layout (shape, strides in counts, itemsize).
    • Creating layout directly from shape/strides:
      StridedLayout(shape, strides, itemsize), StridedLayout(a.shape, a.strides, a.itemsize, divide_strides=True)
    • Creating dense layout with specific stride order (C|F or permutation) StridedLayout.dense(shape, itemsize, stride_order)
    • Creating dense layout from another one: StridedLayout.dense_like and self.to_dense
    • Implements properties: ndim, shape, volume, strides, strides_in_bytes, stride_order, min_offset, max_offset, is_contiguous(c|f|any), is_unique
    • required_size_in_bytes method for required allocation size
    • Stride manipulation helpers for reshaping, (un)squeezing, permuting, flattening, repacking (changing itemsize, as in viewing float tenor as complex one), broadcasting, slicing.

From Python, StridedLayout is immutable, stride manipulation methods return a new instance. In Cython, to avoid temporary objects in a sequence of operations, layout manipulations methods can be run in place.
Please take a look at the StridedLayout docs for more details and examples.

  1. Enables wrapping external allocation into Buffer (Buffer.from_handle(ptr, owner=obj)). The owner and memory resource cannot be specified together. The owner reference is kept until the Buffer is closed. Without the memory resource, Buffer now queries driver for host/device accessibility and device_id of the ptr.

  2. StridedMemoryView uses now StridedLayout to represent the shape/strides.

    • For DLPack/CAI imported tensors, the layout is lazily created if needed.
    • There's a new class method from_buffer(buffer, layout, optional dtype) to create SMV from Buffer and StridedLayout. For example to implement empty_like() method for numpy array, but allocated on a device, one could:
    def device_tensor_like(a : numpy.ndarray, device : ccx.Device) -> StridedMemoryView:
        a_view = StridedMemoryView(a, -1)
        # get the original layout of ``a`` and convert it to a dense layout
        # to avoid overallocating memory (e.g. if the ``a`` was sliced)
        layout = a_view.layout.to_dense()
        # get the required size in bytes to fit the tensor
        required_size = layout.required_size_in_bytes()
        # allocate the memory on the device
        device.set_current()
        mem = device.allocate(required_size)
        # create a view on the newly allocated device memory
        b_view = StridedMemoryView.from_buffer(mem, layout, a_view.dtype)
        return b_view
    
    • The StridedMemoryView can be now exported via dlpack. (delayed for later)
    • The StridedMemoryView.copy_from, StridedMemoryView.copy_to allow copying data between views (in a follow-up PR).

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Nov 21, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.


args_viewable_as_strided_memory

:template: dataclass.rst
Copy link
Member Author

@stiepan stiepan Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • dataclass.rst does not render methods.
  • class.rst omits cythonized properties

cyclass places attributes section just after the main class docstring. this way we can document the actual attributes at the end of the main docstring and they are followed by docstring of all the properties.

@stiepan stiepan changed the title Introduce strided layout memview feat: Introduce StridedLayout, support wrapping external allocations in Buffer, support creating StridedMemoryView from Buffer and StridedLayout, export SMV via dlpack. Nov 21, 2025
@stiepan stiepan changed the title feat: Introduce StridedLayout, support wrapping external allocations in Buffer, support creating StridedMemoryView from Buffer and StridedLayout, export SMV via dlpack. feat: Introduce StridedLayout, support wrapping external allocations in Buffer, add StridedMemoryView.from_buffer, export SMV via dlpack. Nov 21, 2025
@leofang leofang added triage Needs the team's attention feature New feature or request cuda.core Everything related to the cuda.core module labels Nov 24, 2025
Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking in EOD progress. I haven't reviewed layout/memoryview.

Also, I assume you're working migrating the tests?

dl_tensor = &dlm_tensor_ver.dl_tensor
dlm_tensor_ver.version.major = DLPACK_MAJOR_VERSION
dlm_tensor_ver.version.minor = DLPACK_MINOR_VERSION
cpython.Py_INCREF(buf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to call decref in the except: branch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleaners helpers do so. I moved the cpython.Py_INCREF call next to assigning its address to manager_ctx, for the clearer "release" ptr semantics. If the manager_ctx is set, the DECREF needs to be called.

driver.CUpointer_attribute.CU_POINTER_ATTRIBUTE_MEMORY_TYPE,
driver.CUpointer_attribute.CU_POINTER_ATTRIBUTE_DEVICE_ORDINAL,
)
return driver.cuPointerGetAttributes(len(attrs), attrs, ptr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: cimport this from cydriver

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙃

I've actually had that this way initially, but seeing all the cdriver imports are gone from buffer, went along with the Python API. I can undig the previous variant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry. I think it's not "gone" gone, most likely @Andy-Jost found that we don't need many driver API calls in this file after the refactoring (#1205). But pointer attribute checks are in the hot path so we should cythonize it.

In fact, I am trying to catch up with what @fbusato is doing in C++ (NVIDIA/cccl#6733), which is an equivalent check (but for C++ mdspan instead of Python SMV).

Copy link
Member Author

@stiepan stiepan Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reference! Looking at the logic in cccl, I adjusted managed memory "discovery". I am not sure if we need to go into so much details as trying to get particular memory pool and check if readability flag is set there, I did not add this, but can adjust if needed.

In any case, I moved back to cydriver API and added tests with host/device/managed/pinned from cuda malloc and pinned from cuda register.

For pinned memory, I am not 100% sure what happens on devices for which
CU_DEVICE_ATTRIBUTE_CAN_USE_HOST_POINTER_FOR_REGISTERED_MEM is false. I.e. if one registers host memory with cuda host register and passes the original pointer, while the pointer to access on device is different. I.e. would it be still memory_type = 0 or memory_type = host and what would be a desired is_device_accessible value.

#include <numeric>


#define STRIDED_LAYOUT_MAX_NDIM 32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: is there any particular reason (apart from using 32-bit masks) we want to limit this to 32? There are libraries that use a higher limit, for example cuTENSOR (@yangcal would remember what the current limit is).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, precompiling the copy kernels was on the table, so sticking to 32-ndim was more appealing. Now, I don't think there's anything preventing us from going to 64-dims. Especially, if there's really a need for that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
…pack export

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
…n in reshape, fix to dense with sliced views

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
@stiepan stiepan force-pushed the introduce_strided_layout_memview branch from c238c1a to dc27268 Compare November 26, 2025 18:06
@stiepan stiepan marked this pull request as ready for review November 26, 2025 18:16
@stiepan stiepan changed the title feat: Introduce StridedLayout, support wrapping external allocations in Buffer, add StridedMemoryView.from_buffer, export SMV via dlpack. feat: Introduce StridedLayout, support wrapping external allocations in Buffer, add StridedMemoryView.from_buffer Nov 26, 2025
@stiepan
Copy link
Member Author

stiepan commented Nov 27, 2025

The dlpack fix moved out of this PR is here #1291

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leofang,

NB.

from cuda.bindings import driver
driver.cuPointerGetAttributes(1, [driver.CUpointer_attribute.CU_POINTER_ATTRIBUTE_DEVICE_ORDINAL], 0)

returns

(<CUresult.CUDA_SUCCESS: 0>, [4294967294])
while the actual value returned by the underlying API is -2. Looks like some unsigned/signed type mismatch.

… add tests for the attributes

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
@leofang
Copy link
Member

leofang commented Nov 30, 2025

Let's kick off CI

@leofang
Copy link
Member

leofang commented Nov 30, 2025

/ok to test 3a904e7

@github-actions
Copy link

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
@stiepan
Copy link
Member Author

stiepan commented Dec 1, 2025

A single test case failed - testing pointer attributes for host memory "manually pinned" with cuMemHostRregister. It failed on pre-condition assert that the memory is not device accessible before it is registered. And it failed in a second of two cases testing this. The test did not clean-up properly - it was missing unregister call. I am guessing we ended up with the same pointer in the second case. The 38ddb36 should fix that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module feature New feature or request triage Needs the team's attention

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants