From 49be2b470f23716609cc60eb638aab70d09081d7 Mon Sep 17 00:00:00 2001 From: m96-chan Date: Sun, 14 Dec 2025 15:08:04 +0900 Subject: [PATCH 1/3] fix(jit): add driver-only mode support for LaunchConfig MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Replace CUDA Runtime types with portable alternatives: - Add Dim3 struct for driver-only mode (replaces dim3) - Use StreamHandle from stream.hpp (replaces cudaStream_t) This fixes the test-driver-only-windows CI job which failed because dim3 and cudaStream_t are CUDA Runtime types not available when building with PYGPUKIT_DRIVER_ONLY=ON. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 --- native/jit/kernel.hpp | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/native/jit/kernel.hpp b/native/jit/kernel.hpp index acf6933..fbd4e5b 100644 --- a/native/jit/kernel.hpp +++ b/native/jit/kernel.hpp @@ -7,6 +7,18 @@ #include #include +#ifdef PYGPUKIT_DRIVER_ONLY +// Driver-only mode: define our own Dim3 struct +struct Dim3 { + unsigned int x, y, z; + Dim3(unsigned int x_ = 1, unsigned int y_ = 1, unsigned int z_ = 1) + : x(x_), y(y_), z(z_) {} +}; +#else +#include +using Dim3 = dim3; +#endif + namespace pygpukit { // Forward declaration @@ -14,10 +26,10 @@ class JITKernel; // Kernel launch configuration struct LaunchConfig { - dim3 grid; - dim3 block; + Dim3 grid; + Dim3 block; size_t shared_mem; - cudaStream_t stream; + StreamHandle stream; LaunchConfig() : grid(1), block(256), shared_mem(0), stream(nullptr) {} @@ -25,7 +37,7 @@ struct LaunchConfig { LaunchConfig(unsigned int grid_x, unsigned int block_x) : grid(grid_x), block(block_x), shared_mem(0), stream(nullptr) {} - LaunchConfig(dim3 g, dim3 b, size_t smem = 0, cudaStream_t s = nullptr) + LaunchConfig(Dim3 g, Dim3 b, size_t smem = 0, StreamHandle s = nullptr) : grid(g), block(b), shared_mem(smem), stream(s) {} }; From 4caac203fa05ba6a9c0accb21b48bfaffedcc3be Mon Sep 17 00:00:00 2001 From: m96-chan Date: Sun, 14 Dec 2025 15:34:51 +0900 Subject: [PATCH 2/3] docs(readme): clarify CUDA requirements and scope MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Remove false claim about "no mandatory external SDKs" - Add note: CUDA drivers and NVRTC are currently required - Clarify PyGPUkit is NOT a PyTorch/CuPy replacement - Update "Lightweight" feature description 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 --- README.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 737726b..6654619 100644 --- a/README.md +++ b/README.md @@ -22,8 +22,10 @@ PyGPUkit aims to be the "micro-runtime for GPU computing": small, fast, and idea ## Opening Paragraph (Goal Statement) -PyGPUkit aims to free developers from the complexity of CUDA Toolkit, Anaconda, and fragile GPU environments. -Its goal is to make GPU programming and model execution feel like using a standard Python library: installable via pip, minimal setup, and no mandatory external SDKs. PyGPUkit provides high-performance GPU kernels, memory management, scheduling, and model execution (e.g. SafeTensors) through a NumPy-like API and a Kubernetes-inspired resource model, allowing developers to use GPUs explicitly, predictably, and productively without fighting their environment. +PyGPUkit aims to simplify GPU development by reducing dependency on complex CUDA Toolkit installations and fragile GPU environments. +Its goal is to make GPU programming feel like using a standard Python library: installable via pip with minimal setup. PyGPUkit provides high-performance GPU kernels, memory management, and scheduling through a NumPy-like API and a Kubernetes-inspired resource model, allowing developers to use GPUs explicitly, predictably, and productively. + +> **Note:** PyGPUkit currently requires CUDA drivers and NVRTC. It is NOT a PyTorch/CuPy replacement—it's a lightweight runtime for custom GPU workloads, research, and real-time systems where full ML frameworks are overkill. --- @@ -76,7 +78,7 @@ Its goal is to make GPU programming and model execution feel like using a standa --- ## Features -- **Lightweight** — no PyTorch/CuPy overhead +- **Lightweight** — smaller footprint than PyTorch/CuPy (not a replacement) - **Modular** — runtime / memory / scheduler / JIT / ops - **Rust Backend** — memory pool, scheduler, dispatch in Rust - **GPUArray** with NumPy interop From 64f78e0d4e7643673bd934633f6ab23ef313a4be Mon Sep 17 00:00:00 2001 From: m96-chan Date: Sun, 14 Dec 2025 15:37:08 +0900 Subject: [PATCH 3/3] docs(readme): condense roadmap for released versions MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Replaced 37-line checklist with compact table for v0.1-v0.2.3. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 --- README.md | 48 +++++++++--------------------------------------- 1 file changed, 9 insertions(+), 39 deletions(-) diff --git a/README.md b/README.md index 6654619..7425765 100644 --- a/README.md +++ b/README.md @@ -323,45 +323,15 @@ PyGPUkit/ ## Roadmap -### **v0.1 (Released)** -- [x] GPUArray -- [x] NVRTC JIT -- [x] add/mul/matmul ops -- [x] Basic stream manager -- [x] Packaging + wheels - -### **v0.2.0 (Released)** -- [x] Rust Memory Pool (LRU, size-class) -- [x] Rust Scheduler (priority, memory reservation) -- [x] Rust Transfer Engine (async H2D/D2H) -- [x] Rust Kernel Dispatch Controller -- [x] Admission Control -- [x] QoS Policy Framework (Guaranteed/Burstable/BestEffort) -- [x] Kernel Pacing Engine -- [x] Micro-Slicing Framework -- [x] Pinned Memory Support -- [x] Kernel Cache (PTX caching) -- [x] GPU Partitioning -- [x] Tiled Matmul (shared memory) -- [x] 106 Rust tests - -### **v0.2.1 — Stabilization Phase (Released)** -- [x] Admission / QoS spec finalization -- [x] Python API inconsistency fixes -- [x] Rust error propagation unification - -### **v0.2.2 — Performance Phase (Released)** -- [x] Ampere-optimized SGEMM with cp.async pipeline -- [x] 4-stage software pipelining for latency hiding -- [x] float4 vectorized memory loads -- [x] 18.2 TFLOPS on RTX 3090 Ti (46% efficiency) -- [x] SM 80+ (Ampere) architecture requirement - -### **v0.2.3 — TF32 TensorCore Phase (Released)** -- [x] TF32 TensorCore GEMM with PTX mma.sync -- [x] cp.async double-buffered pipeline -- [x] 27.5 TFLOPS on RTX 3090 Ti -- [x] PTX fragment mapping documentation +### **v0.1 — v0.2.3 (Released)** + +| Version | Highlights | +|---------|------------| +| **v0.1** | GPUArray, NVRTC JIT, add/mul/matmul, wheels | +| **v0.2.0** | Rust scheduler (QoS, admission control, partitioning), memory pool (LRU), kernel cache, 106 Rust tests | +| **v0.2.1** | API stabilization, error propagation | +| **v0.2.2** | Ampere SGEMM (cp.async, float4), 18 TFLOPS FP32 | +| **v0.2.3** | TF32 TensorCore (PTX mma.sync), 27.5 TFLOPS | ### **v0.2.4 — Benchmark & Reliability Phase** - [ ] Actual PyTorch/NumPy comparison benchmarks