|
| 1 | +# Overview |
| 2 | + |
| 3 | +## What is `cuda core`? |
| 4 | + |
| 5 | +`cuda.core` provides a Pythonic interface to the CUDA runtime and other functionality, |
| 6 | +including: |
| 7 | + |
| 8 | +- Compiling and launching CUDA kernels |
| 9 | +- Asynchronous concurrent execution with CUDA graphs, streams and events |
| 10 | +- Coordinating work across multiple CUDA devices |
| 11 | +- Allocating, transfering, and managing device memory |
| 12 | +- Runtime linking of device code with Link-Time Optimization (LTO) |
| 13 | +- and much more! |
| 14 | + |
| 15 | +Rather than providing 1:1 equivalents of the CUDA driver and runtime APIs |
| 16 | +(for that, see [`cuda.bindings`][bindings]), `cuda.core` provides high-level constructs such as: |
| 17 | + |
| 18 | +- {class}`Device <cuda.core.experimental.Device>` class for GPU device operations and context management. |
| 19 | +- {class}`Buffer <cuda.core.experimental.Buffer>` and {class}`MemoryResource <cuda.core.experimental.MemoryResource>` classes for memory allocation and management. |
| 20 | +- {class}`Program <cuda.core.experimental.Program>` for JIT compilation of CUDA kernels. |
| 21 | +- {class}`GraphBuilder <cuda.core.experimental.GraphBuilder>` for building and executing CUDA graphs. |
| 22 | +- {class}`Stream <cuda.core.experimental.Stream>` and {class}`Event <cuda.core.experimental.Event>` for asynchronous execution and timing. |
| 23 | + |
| 24 | +## Example: Compiling and Launching a CUDA kernel |
| 25 | + |
| 26 | +To get a taste for `cuda.core`, let's walk through a simple example that compiles and launches a vector addition kernel. |
| 27 | +You can find the complete example in [`vector_add.py`][vector_add_example]. |
| 28 | + |
| 29 | +First, we define a string containing the CUDA C++ kernel. Note that this is a templated kernel: |
| 30 | + |
| 31 | +```python |
| 32 | +# compute c = a + b |
| 33 | +code = """ |
| 34 | +template<typename T> |
| 35 | +__global__ void vector_add(const T* A, |
| 36 | + const T* B, |
| 37 | + T* C, |
| 38 | + size_t N) { |
| 39 | + const unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x; |
| 40 | + for (size_t i=tid; i<N; i+=gridDim.x*blockDim.x) { |
| 41 | + C[tid] = A[tid] + B[tid]; |
| 42 | + } |
| 43 | +} |
| 44 | +""" |
| 45 | +``` |
| 46 | + |
| 47 | +Next, we create a {class}`Device <cuda.core.experimental.Device>` object |
| 48 | +and a corresponding {class}`Stream <cuda.core.experimental.Stream>`. |
| 49 | +Don't forget to use {meth}`Device.set_current() <cuda.core.experimental.Device.set_current>`! |
| 50 | + |
| 51 | +```python |
| 52 | +from cuda.core.experimental import Device, LaunchConfig, Program, ProgramOptions, launch |
| 53 | + |
| 54 | +dev = Device() |
| 55 | +dev.set_current() |
| 56 | +s = dev.create_stream() |
| 57 | +``` |
| 58 | + |
| 59 | +Next, we compile the CUDA C++ kernel from earlier using the {class}`Program <cuda.core.experimental.Program>` class. |
| 60 | +The result of the compilation is saved as a CUBIN. |
| 61 | +Note the use of the `name_expressions` parameter to the {meth}`Program.compile() <cuda.core.experimental.Program.compile>` method to specify which kernel template instantiations to compile: |
| 62 | + |
| 63 | +```python |
| 64 | +arch = "".join(f"{i}" for i in dev.compute_capability) |
| 65 | +program_options = ProgramOptions(std="c++17", arch=f"sm_{arch}") |
| 66 | +prog = Program(code, code_type="c++", options=program_options) |
| 67 | +mod = prog.compile("cubin", name_expressions=("vector_add<float>",)) |
| 68 | +``` |
| 69 | + |
| 70 | +Next, we retrieve the compiled kernel from the CUBIN and prepare the arguments and kernel configuration. |
| 71 | +We're using [CuPy][cupy] arrays as inputs for this example, but you can use PyTorch tensors too |
| 72 | +(we show how to do this in one of our [examples][examples]). |
| 73 | + |
| 74 | +```python |
| 75 | +ker = mod.get_kernel("vector_add<float>") |
| 76 | + |
| 77 | +# Prepare input/output arrays (using CuPy) |
| 78 | +size = 50000 |
| 79 | +a = rng.random(size, dtype=cp.float32) |
| 80 | +b = rng.random(size, dtype=cp.float32) |
| 81 | +c = cp.empty_like(a) |
| 82 | + |
| 83 | +# Configure launch parameters |
| 84 | +block = 256 |
| 85 | +grid = (size + block - 1) // block |
| 86 | +config = LaunchConfig(grid=grid, block=block) |
| 87 | +``` |
| 88 | + |
| 89 | +Finally, we use the {func}`launch <cuda.core.experimental.launch>` function to execute our kernel on the specified stream with the given configuration and arguments. Note the use of `.data.ptr` to get the pointer to the array data. |
| 90 | + |
| 91 | +```python |
| 92 | +launch(s, config, ker, a.data.ptr, b.data.ptr, c.data.ptr, cp.uint64(size)) |
| 93 | +s.sync() |
| 94 | +``` |
| 95 | + |
| 96 | +This example demonstrates one of the core workflows enabled by `cuda.core`: compiling and launching CUDA code. |
| 97 | +Note the clean, Pythonic interface, and absense of any direct calls to the CUDA runtime/driver APIs! |
| 98 | + |
| 99 | +## Examples and Recipes |
| 100 | + |
| 101 | +As we mentioned before, `cuda.core` can do much more than just compile and launch kernels! |
| 102 | + |
| 103 | +The best way to explore and learn the different features `cuda.core` is through |
| 104 | +our [`examples`][examples]. Find one that matches your use-case, and modify it to fit your needs! |
| 105 | + |
| 106 | + |
| 107 | +[bindings]: https://nvidia.github.io/cuda-python/cuda-bindings/latest/ |
| 108 | +[cai]: https://numba.readthedocs.io/en/stable/cuda/cuda_array_interface.html |
| 109 | +[cupy]: https://cupy.dev/ |
| 110 | +[dlpack]: https://dmlc.github.io/dlpack/latest/ |
| 111 | +[examples]: https://github.com/NVIDIA/cuda-python/tree/main/cuda_core/examples |
| 112 | +[vector_add_example]: https://github.com/NVIDIA/cuda-python/tree/main/cuda_core/examples/vector_add.py |
0 commit comments