Skip to content

Commit dc63296

Browse files
committed
Add getting started page
1 parent f161652 commit dc63296

File tree

3 files changed

+128
-7
lines changed

3 files changed

+128
-7
lines changed

cuda_core/docs/source/api.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,8 @@ CUDA runtime
2121
GraphBuilder
2222
launch
2323
Buffer
24+
Stream
25+
Event
2426
MemoryResource
2527
DeviceMemoryResource
2628
LegacyPinnedMemoryResource
Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# Overview
2+
3+
## What is `cuda core`?
4+
5+
`cuda.core` provides a Pythonic interface to the CUDA runtime and other functionality,
6+
including:
7+
8+
- Compiling and launching CUDA kernels
9+
- Asynchronous concurrent execution with CUDA graphs, streams and events
10+
- Coordinating work across multiple CUDA devices
11+
- Allocating, transfering, and managing device memory
12+
- Runtime linking of device code with Link-Time Optimization (LTO)
13+
- and much more!
14+
15+
Rather than providing 1:1 equivalents of the CUDA driver and runtime APIs
16+
(for that, see [`cuda.bindings`][bindings]), `cuda.core` provides high-level constructs such as:
17+
18+
- {class}`Device <cuda.core.experimental.Device>` class for GPU device operations and context management.
19+
- {class}`Buffer <cuda.core.experimental.Buffer>` and {class}`MemoryResource <cuda.core.experimental.MemoryResource>` classes for memory allocation and management.
20+
- {class}`Program <cuda.core.experimental.Program>` for JIT compilation of CUDA kernels.
21+
- {class}`GraphBuilder <cuda.core.experimental.GraphBuilder>` for building and executing CUDA graphs.
22+
- {class}`Stream <cuda.core.experimental.Stream>` and {class}`Event <cuda.core.experimental.Event>` for asynchronous execution and timing.
23+
24+
## Example: Compiling and Launching a CUDA kernel
25+
26+
To get a taste for `cuda.core`, let's walk through a simple example that compiles and launches a vector addition kernel.
27+
You can find the complete example in [`vector_add.py`][vector_add_example].
28+
29+
First, we define a string containing the CUDA C++ kernel. Note that this is a templated kernel:
30+
31+
```python
32+
# compute c = a + b
33+
code = """
34+
template<typename T>
35+
__global__ void vector_add(const T* A,
36+
const T* B,
37+
T* C,
38+
size_t N) {
39+
const unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
40+
for (size_t i=tid; i<N; i+=gridDim.x*blockDim.x) {
41+
C[tid] = A[tid] + B[tid];
42+
}
43+
}
44+
"""
45+
```
46+
47+
Next, we create a {class}`Device <cuda.core.experimental.Device>` object
48+
and a corresponding {class}`Stream <cuda.core.experimental.Stream>`.
49+
Don't forget to use {meth}`Device.set_current() <cuda.core.experimental.Device.set_current>`!
50+
51+
```python
52+
from cuda.core.experimental import Device, LaunchConfig, Program, ProgramOptions, launch
53+
54+
dev = Device()
55+
dev.set_current()
56+
s = dev.create_stream()
57+
```
58+
59+
Next, we compile the CUDA C++ kernel from earlier using the {class}`Program <cuda.core.experimental.Program>` class.
60+
The result of the compilation is saved as a CUBIN.
61+
Note the use of the `name_expressions` parameter to the {meth}`Program.compile() <cuda.core.experimental.Program.compile>` method to specify which kernel template instantiations to compile:
62+
63+
```python
64+
arch = "".join(f"{i}" for i in dev.compute_capability)
65+
program_options = ProgramOptions(std="c++17", arch=f"sm_{arch}")
66+
prog = Program(code, code_type="c++", options=program_options)
67+
mod = prog.compile("cubin", name_expressions=("vector_add<float>",))
68+
```
69+
70+
Next, we retrieve the compiled kernel from the CUBIN and prepare the arguments and kernel configuration.
71+
We're using [CuPy][cupy] arrays as inputs for this example, but you can use PyTorch tensors too
72+
(we show how to do this in one of our [examples][examples]).
73+
74+
```python
75+
ker = mod.get_kernel("vector_add<float>")
76+
77+
# Prepare input/output arrays (using CuPy)
78+
size = 50000
79+
a = rng.random(size, dtype=cp.float32)
80+
b = rng.random(size, dtype=cp.float32)
81+
c = cp.empty_like(a)
82+
83+
# Configure launch parameters
84+
block = 256
85+
grid = (size + block - 1) // block
86+
config = LaunchConfig(grid=grid, block=block)
87+
```
88+
89+
Finally, we use the {func}`launch <cuda.core.experimental.launch>` function to execute our kernel on the specified stream with the given configuration and arguments. Note the use of `.data.ptr` to get the pointer to the array data.
90+
91+
```python
92+
launch(s, config, ker, a.data.ptr, b.data.ptr, c.data.ptr, cp.uint64(size))
93+
s.sync()
94+
```
95+
96+
This example demonstrates one of the core workflows enabled by `cuda.core`: compiling and launching CUDA code.
97+
Note the clean, Pythonic interface, and absense of any direct calls to the CUDA runtime/driver APIs!
98+
99+
## Examples and Recipes
100+
101+
As we mentioned before, `cuda.core` can do much more than just compile and launch kernels!
102+
103+
The best way to explore and learn the different features `cuda.core` is through
104+
our [`examples`][examples]. Find one that matches your use-case, and modify it to fit your needs!
105+
106+
107+
[bindings]: https://nvidia.github.io/cuda-python/cuda-bindings/latest/
108+
[cai]: https://numba.readthedocs.io/en/stable/cuda/cuda_array_interface.html
109+
[cupy]: https://cupy.dev/
110+
[dlpack]: https://dmlc.github.io/dlpack/latest/
111+
[examples]: https://github.com/NVIDIA/cuda-python/tree/main/cuda_core/examples
112+
[vector_add_example]: https://github.com/NVIDIA/cuda-python/tree/main/cuda_core/examples/vector_add.py

cuda_core/docs/source/index.rst

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,30 @@
11
.. SPDX-License-Identifier: Apache-2.0
22
3-
``cuda.core``: Pythonic access to CUDA core functionalities
4-
===========================================================
3+
``cuda.core``: Pythonic access to CUDA core functionality
4+
=========================================================
55

6-
The new Python module ``cuda.core`` offers idiomatic, pythonic access to CUDA runtime
7-
and other functionalities.
6+
Welcome to the documentation for ``cuda.core``.
87

98
.. toctree::
109
:maxdepth: 2
1110
:caption: Contents:
1211

13-
release
14-
install.md
12+
getting-started
13+
install
1514
interoperability
1615
api
1716
contribute
18-
conduct.md
17+
18+
.. toctree::
19+
:maxdepth: 1
20+
21+
conduct
1922
license
2023

24+
.. toctree::
25+
:maxdepth: 2
26+
27+
release
2128

2229
Indices and tables
2330
==================

0 commit comments

Comments
 (0)