|
| 1 | +# CUDA and ROCm Feature Guide (Living Document) |
| 2 | + |
| 3 | +Last updated: 2025-09-22 |
| 4 | + |
| 5 | +This guide summarizes current, officially documented features of NVIDIA CUDA and AMD ROCm that we leverage across this project. It is designed to be easy to maintain as new versions ship. Where possible, we link to authoritative sources instead of restating volatile details. |
| 6 | + |
| 7 | +Tip: Prefer the linked release notes and programming guides for exact, version-specific behavior. Update checklist is at the end of this document. |
| 8 | + |
| 9 | +--- |
| 10 | + |
| 11 | +## Current Versions at a Glance |
| 12 | + |
| 13 | +- CUDA: 13.0 Update 1 (13.0.U1) |
| 14 | + - Source of truth: NVIDIA CUDA Toolkit Release Notes |
| 15 | + - Driver requirement overview: CUDA Compatibility Guide for Drivers |
| 16 | +- ROCm: 7.0.1 |
| 17 | + - Source of truth: ROCm Release History and ROCm docs index |
| 18 | + |
| 19 | +Reference links are provided at the bottom for maintenance. |
| 20 | + |
| 21 | +--- |
| 22 | + |
| 23 | +## CUDA 13.x overview |
| 24 | + |
| 25 | +Highlights pulled from NVIDIA’s official docs (see links): |
| 26 | + |
| 27 | +- General platform |
| 28 | + - CUDA 13.x is ABI-stable within the major series; requires r580+ driver on Linux. |
| 29 | + - Increased MPS server client limits on Ampere and newer architectures (subject to architectural limits). |
| 30 | +- Compiler and runtime |
| 31 | + - NVCC/NVRTC updates; PTX ISA updates (see PTX 9.0 notes in release docs). |
| 32 | + - Programmatic Dependent Launch (PDL) support in select library kernels on sm_90+. |
| 33 | +- Developer tools |
| 34 | + - Nsight Systems and Nsight Compute continue as the primary profilers. |
| 35 | + - Compute Sanitizer updates; Visual Profiler and nvprof are removed in 13.0. |
| 36 | +- Deprecations and removals |
| 37 | + - Dropped offline compilation/library support for pre-Turing architectures (Maxwell, Pascal, Volta) in CUDA 13.0. Continue to use 12.x to target these. |
| 38 | + - Windows Toolkit no longer bundles a display driver (install separately). |
| 39 | + - Removed multi-device cooperative group launch APIs; several legacy headers removed. |
| 40 | + |
| 41 | +Architectures and typical use cases (non-exhaustive): |
| 42 | + |
| 43 | +- Blackwell/Blackwell Ultra (SM110+): next‑gen AI/HPC; FP4/FP8 workflows via libraries. |
| 44 | +- Hopper (H100/H200, SM90): transformer engine, thread block clusters, DPX; AI training/HPC. |
| 45 | +- Ada (RTX 40): workstation/development; AV1 encode; content creation/AI dev. |
| 46 | +- Ampere (A100/RTX 30): MIG, 3rd‑gen tensor cores; research/mixed workloads. |
| 47 | + |
| 48 | +Core libraries snapshot (examples; see library release notes for specifics): |
| 49 | + |
| 50 | +- cuBLAS/cuBLASLt: autotuning options; improvements on newer architectures; mixed precision and block‑scaled formats. |
| 51 | +- cuFFT: new error codes; performance changes; dropped pre‑Turing support. |
| 52 | +- cuSPARSE: generic API enhancements; 64‑bit indices in SpGEMM; various bug fixes. |
| 53 | +- Math/NPP/nvJPEG: targeted perf/accuracy improvements and API cleanups. |
| 54 | + |
| 55 | +Authoritative references: |
| 56 | + |
| 57 | +- CUDA Toolkit Release Notes (13.0 U1) |
| 58 | +- CUDA Compatibility Guide for Drivers |
| 59 | +- Nsight Systems Release Notes; Nsight Compute Release Notes |
| 60 | +- CUDA C++ Programming Guide changelog |
| 61 | + |
| 62 | +--- |
| 63 | + |
| 64 | +## ROCm 7.0.x overview |
| 65 | + |
| 66 | +Highlights from AMD’s official docs (see links): |
| 67 | + |
| 68 | +- ROCm 7.0.1 is the latest as of 2025‑09‑17; consult the release history for point updates. |
| 69 | +- HIP as the primary programming model, with CUDA‑like APIs and HIP‑Clang toolchain. |
| 70 | +- Windows support targets HIP SDK for development; full ROCm stack targets Linux. |
| 71 | +- ROCm Libraries monorepo: multiple core math and support libraries are consolidated in the ROCm Libraries monorepo for unified CI/build. Projects included (as of rocm‑7.0.1): composablekernel, hipblas, hipblas-common, hipblaslt, hipcub, hipfft, hiprand, hipsolver, hipsparse, hipsparselt, miopen, rocblas, rocfft, rocprim, rocrand, rocsolver, rocsparse, rocthrust. Shared components: rocroller, tensile, mxdatagenerator. Most of these are marked “Completed” in the monorepo migration status and the monorepo is the source of truth; see its README for current status. |
| 72 | +- Tooling and system components: ROCr runtime, ROCm SMI, rocprof/rocprofiler, rocgdb/rocm‑debug‑agent. |
| 73 | + |
| 74 | +Nomenclature: project names in the monorepo are standardized to match released package names (for example, hipblas/hipfft/rocsparse instead of mixed casing). |
| 75 | + |
| 76 | +Architectures (illustrative, not exhaustive): |
| 77 | + |
| 78 | +- CDNA3 (MI300 family): AI training and HPC; unified memory on APUs (MI300A), large HBM configs (MI300X). |
| 79 | +- RDNA3 (Radeon 7000 series): workstation/gaming; AV1 encode/decode; hardware ray tracing. |
| 80 | + |
| 81 | +Common libraries (see ROCm Libraries reference and monorepo): |
| 82 | + |
| 83 | +- BLAS/solver/sparse: rocBLAS / hipBLAS, hipBLASLt, rocSOLVER / hipSOLVER, rocSPARSE / hipSPARSE, hipSPARSElt. |
| 84 | +- FFT/random/core: rocFFT / hipFFT, rocRAND / hipRAND, rocPRIM / hipCUB, rocThrust. |
| 85 | +- Kernel building blocks: composablekernel; shared dependencies like Tensile and rocRoller (used by rocBLAS/hipBLASLt). |
| 86 | +- ML/DL: MIOpen; framework integrations via the ROCm for AI guide. |
| 87 | + |
| 88 | +Authoritative references: |
| 89 | + |
| 90 | +- ROCm Docs index (What is ROCm?, install, reference) |
| 91 | +- ROCm Release History (7.0.1, 7.0.0, …) |
| 92 | +- ROCm libraries reference; tools/compilers/runtimes reference |
| 93 | +- ROCm Libraries monorepo (status, structure, releases): https://github.com/ROCm/rocm-libraries |
| 94 | + |
| 95 | +--- |
| 96 | + |
| 97 | +## Cross‑platform mapping (CUDA ⇄ HIP) |
| 98 | + |
| 99 | +Quick mapping for common concepts. Always check specific APIs for support and behavior differences. |
| 100 | + |
| 101 | +- Kernel launch |
| 102 | + - CUDA: <<<grid, block, shared, stream>>>; HIP: hipLaunchKernelGGL |
| 103 | +- Memory management |
| 104 | + - CUDA: cudaMalloc/cudaMemcpy/etc.; HIP: hipMalloc/hipMemcpy/etc. |
| 105 | +- Streams and events |
| 106 | + - CUDA: cudaStream_t/cudaEvent_t; HIP: hipStream_t/hipEvent_t |
| 107 | +- Graphs |
| 108 | + - CUDA: cudaGraph_t and Graph Exec; HIP: hipGraph_t and equivalents; feature coverage evolves, verify against ROCm docs. |
| 109 | +- Cooperative groups |
| 110 | + - CUDA: cooperative_groups; HIP: HIP cooperative groups header; multi‑device variants differ (and some CUDA multi‑device APIs removed in 13.0). |
| 111 | +- Libraries |
| 112 | + - cuBLAS ↔ hipBLAS/rocBLAS; cuFFT ↔ hipFFT/rocFFT; cuSPARSE ↔ hipSPARSE/rocSPARSE; Thrust/CUB ↔ rocThrust/hipCUB/rocPRIM. |
| 113 | + |
| 114 | +Porting aids: |
| 115 | + |
| 116 | +- hipify (perl/python) for source translation; hip‑clang for compilation. |
| 117 | + |
| 118 | +--- |
| 119 | + |
| 120 | +## Compatibility and supported platforms |
| 121 | + |
| 122 | +- CUDA drivers and OS |
| 123 | + - See the CUDA Compatibility Guide for minimum driver versions by toolkit series (e.g., 13.x requires r580+ on Linux). Windows driver no longer bundled starting with 13.0. |
| 124 | +- CUDA architectures |
| 125 | + - 13.0 drops offline compilation/library support for Maxwell/Pascal/Volta; continue to use 12.x for those targets. |
| 126 | +- ROCm OS/GPU support |
| 127 | + - See ROCm install guides and GPU/accelerator support references for Linux and Windows HIP SDK system requirements. |
| 128 | + |
| 129 | +--- |
| 130 | + |
| 131 | +## Educational integration (this repository) |
| 132 | + |
| 133 | +This course demonstrates both CUDA and HIP across modules. Key tool updates to note: |
| 134 | + |
| 135 | +- Profiling and analysis |
| 136 | + - NVIDIA: Nsight Systems, Nsight Compute, CUPTI changes in 13.x, Compute Sanitizer |
| 137 | + - AMD: rocprof/rocprofiler, ROCm SMI |
| 138 | +- Memory and graphs |
| 139 | + - CUDA: CUDA Graphs; memory pools and VMM; asynchronous copy |
| 140 | + - ROCm: HIP graph APIs (coverage evolves); ROCr runtime memory features |
| 141 | + |
| 142 | +Example module alignment (indicative; see each module’s README for details): |
| 143 | + |
| 144 | +- Module 1: Runtime APIs, device queries, build/tooling |
| 145 | +- Module 2: Memory management (device, pinned, unified/coherent where available) |
| 146 | +- Module 3: Synchronization and cooperation (warp/wavefront‑level, cooperative groups) |
| 147 | +- Module 4: Streams, events, graphs, and multi‑GPU basics |
| 148 | +- Module 5: Profiling and debugging (Nsight Tools, Compute Sanitizer, rocprof, rocm‑smi) |
| 149 | +- Module 6+: Libraries (BLAS/FFT/SPARSE) and domain examples (AI/HPC) |
| 150 | + |
| 151 | +### New features by module (CUDA 13.x and ROCm 7.0.x) |
| 152 | + |
| 153 | +| Module | CUDA (what you’ll learn) | ROCm/HIP (what you’ll learn) | |
| 154 | +|---|---|---| |
| 155 | +| Module 1: Getting Started | Toolchain (nvcc), project layout, kernel launch basics (grid/block/thread indexing), device vs host code, cudaMalloc/cudaMemcpy, device query and error handling | Toolchain (hipcc/hip-clang), hipLaunchKernelGGL, hipMalloc/hipMemcpy, hipGetDeviceProperties, mapping CUDA concepts to HIP | |
| 156 | +| Module 2: Memory & Data Movement | Global/shared/constant/texture memory usage, coalesced access, pinned memory, unified memory and prefetch, async copies and measuring bandwidth | HIP memory APIs and ROCr memory model, pinned host buffers, unified/coherent memory notes, async transfers, using rocm-smi/rocprof to observe bandwidth | |
| 157 | +| Module 3: Parallel Patterns & Sync | Reductions, scans, sorting; warp-level primitives; cooperative groups; shared memory tiling; atomics and barriers; occupancy considerations | rocPRIM/hipCUB/rocThrust equivalents; wavefront-level ops; HIP cooperative groups; LDS usage; atomics and synchronization semantics | |
| 158 | +| Module 4: Concurrency, Streams & Multi‑GPU | Streams/events, priorities, CUDA Graphs (capture/instantiate/launch), peer-to-peer (UVA/P2P), basic multi‑GPU patterns | hipStream/hipEvent, HIP Graph API coverage and usage, peer access where supported, multi‑GPU fundamentals with ROCm tools | |
| 159 | +| Module 5: Profiling, Debugging & Sanitizers | Nsight Systems (timeline/tracing), Nsight Compute (kernel analysis), Compute Sanitizer (racecheck/memcheck), intro to CUPTI-based profiling | rocprof/rocprofiler for traces and metrics, rocm-smi telemetry, rocgdb/ROCm Debug Agent basics, best practices for profiling | |
| 160 | +| Module 6: Math & Core Libraries | cuBLAS/cuBLASLt (GEMM, batched ops, mixed precision), cuFFT, cuSPARSE, Thrust/CUB algorithms, choosing/tuning library routines | rocBLAS/hipBLAS, rocFFT/hipFFT, rocSPARSE/hipSPARSE, rocThrust/hipCUB/rocPRIM; Tensile-backed tuning in rocBLAS; API parity tips | |
| 161 | +| Module 7: Advanced Algorithms & Optimization | Tiling and cache use, shared memory bank conflicts, cooperative groups for complex patterns, intro to memory pools/VMM, kernel fusion patterns | Wavefront-aware tuning, LDS patterns, rocPRIM building blocks, HIP-specific perf tips, memory behavior across devices | |
| 162 | +| Module 8: AI/ML Workflows | cuDNN basics, TensorRT concepts (dynamic shapes/precision), mixed precision (FP16/BF16/FP8 via libs), graphs for inference pipelines | MIOpen basics, framework setup on ROCm (PyTorch/TF where supported), MIGraphX or framework runtimes, mixed precision support | |
| 163 | +| Module 9: Packaging, Deployment & Containers | CUDA containers (base/runtime-devel), driver/runtime compatibility, minimal deployment artifacts, reproducible builds | ROCm container bases (rocm/dev), runtime setup (kernel modules, groups/permissions), compatibility guidance and reproducibility | |
| 164 | + |
| 165 | +--- |
| 166 | + |
| 167 | +## Maintenance: how to update this document |
| 168 | + |
| 169 | +When CUDA or ROCm releases a new version, follow this checklist: |
| 170 | + |
| 171 | +1) Update versions at the top |
| 172 | + - CUDA: consult CUDA Toolkit Release Notes page; record the latest major.minor (e.g., 13.0 Update 1) and driver requirements. |
| 173 | + - ROCm: consult ROCm Release History; record latest (e.g., 7.0.1). |
| 174 | +2) Scan notable changes |
| 175 | + - CUDA: skim “New Features”, “Deprecated or Dropped Features”, and library sections (cuBLAS/cuFFT/…); note any course‑impacting changes. |
| 176 | + - ROCm: skim “What is ROCm?”, “ROCm libraries”, and “Tools/Compilers/Runtimes” sections for new features or renamed packages. |
| 177 | +3) Verify cross‑platform notes |
| 178 | + - Confirm HIP Graph API coverage and any caveats; update mapping if needed. |
| 179 | +4) Update references |
| 180 | + - Keep the link reference list (below) current; avoid copying long tables—link out to authoritative docs. |
| 181 | +5) Record the date in “Last updated”. |
| 182 | + |
| 183 | +Tip: Avoid claiming specific percentage speedups unless you include a citation. Prefer phrasing like “performance improvements in X; see release notes.” |
| 184 | + |
| 185 | +--- |
| 186 | + |
| 187 | +## Reference links (authoritative sources) |
| 188 | + |
| 189 | +- NVIDIA |
| 190 | + - CUDA Toolkit Release Notes: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html |
| 191 | + - CUDA Compatibility Guide (drivers): https://docs.nvidia.com/deploy/cuda-compatibility/index.html |
| 192 | + - CUDA C++ Programming Guide (changelog): https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#changelog |
| 193 | + - Nsight Systems Release Notes: https://docs.nvidia.com/nsight-systems/ReleaseNotes/index.html |
| 194 | + - Nsight Compute Release Notes: https://docs.nvidia.com/nsight-compute/ReleaseNotes/index.html |
| 195 | +- AMD |
| 196 | + - ROCm docs index: https://rocm.docs.amd.com/en/latest/index.html |
| 197 | + - ROCm release history: https://rocm.docs.amd.com/en/latest/release/versions.html |
| 198 | + - ROCm libraries reference: https://rocm.docs.amd.com/en/latest/reference/api-libraries.html |
| 199 | + - ROCm tools/compilers/runtimes: https://rocm.docs.amd.com/en/latest/reference/rocm-tools.html |
| 200 | + - HIP documentation: https://rocm.docs.amd.com/projects/HIP/en/latest/index.html |
0 commit comments