From 49be2b470f23716609cc60eb638aab70d09081d7 Mon Sep 17 00:00:00 2001
From: m96-chan <y_harada@technologies.moe>
Date: Sun, 14 Dec 2025 15:08:04 +0900
Subject: [PATCH 1/3] fix(jit): add driver-only mode support for LaunchConfig
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Replace CUDA Runtime types with portable alternatives:
- Add Dim3 struct for driver-only mode (replaces dim3)
- Use StreamHandle from stream.hpp (replaces cudaStream_t)

This fixes the test-driver-only-windows CI job which failed because
dim3 and cudaStream_t are CUDA Runtime types not available when
building with PYGPUKIT_DRIVER_ONLY=ON.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---
 native/jit/kernel.hpp | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)
diff --git a/native/jit/kernel.hpp b/native/jit/kernel.hpp
index acf6933..fbd4e5b 100644
--- a/native/jit/kernel.hpp
+++ b/native/jit/kernel.hpp
@@ -7,6 +7,18 @@
 #include <vector>
 #include <memory>
 
+#ifdef PYGPUKIT_DRIVER_ONLY
+// Driver-only mode: define our own Dim3 struct
+struct Dim3 {
+    unsigned int x, y, z;
+    Dim3(unsigned int x_ = 1, unsigned int y_ = 1, unsigned int z_ = 1)
+        : x(x_), y(y_), z(z_) {}
+};
+#else
+#include <cuda_runtime.h>
+using Dim3 = dim3;
+#endif
+
 namespace pygpukit {
 
 // Forward declaration
@@ -14,10 +26,10 @@ class JITKernel;
 
 // Kernel launch configuration
 struct LaunchConfig {
-    dim3 grid;
-    dim3 block;
+    Dim3 grid;
+    Dim3 block;
     size_t shared_mem;
-    cudaStream_t stream;
+    StreamHandle stream;
 
     LaunchConfig()
         : grid(1), block(256), shared_mem(0), stream(nullptr) {}
@@ -25,7 +37,7 @@ struct LaunchConfig {
     LaunchConfig(unsigned int grid_x, unsigned int block_x)
         : grid(grid_x), block(block_x), shared_mem(0), stream(nullptr) {}
 
-    LaunchConfig(dim3 g, dim3 b, size_t smem = 0, cudaStream_t s = nullptr)
+    LaunchConfig(Dim3 g, Dim3 b, size_t smem = 0, StreamHandle s = nullptr)
         : grid(g), block(b), shared_mem(smem), stream(s) {}
 };
 

From 4caac203fa05ba6a9c0accb21b48bfaffedcc3be Mon Sep 17 00:00:00 2001
From: m96-chan <y_harada@technologies.moe>
Date: Sun, 14 Dec 2025 15:34:51 +0900
Subject: [PATCH 2/3] docs(readme): clarify CUDA requirements and scope
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Remove false claim about "no mandatory external SDKs"
- Add note: CUDA drivers and NVRTC are currently required
- Clarify PyGPUkit is NOT a PyTorch/CuPy replacement
- Update "Lightweight" feature description

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---
 README.md | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/README.md b/README.md
index 737726b..6654619 100644
--- a/README.md
+++ b/README.md
@@ -22,8 +22,10 @@ PyGPUkit aims to be the "micro-runtime for GPU computing": small, fast, and idea
 
 ## Opening Paragraph (Goal Statement)
 
-PyGPUkit aims to free developers from the complexity of CUDA Toolkit, Anaconda, and fragile GPU environments.
-Its goal is to make GPU programming and model execution feel like using a standard Python library: installable via pip, minimal setup, and no mandatory external SDKs. PyGPUkit provides high-performance GPU kernels, memory management, scheduling, and model execution (e.g. SafeTensors) through a NumPy-like API and a Kubernetes-inspired resource model, allowing developers to use GPUs explicitly, predictably, and productively without fighting their environment.
+PyGPUkit aims to simplify GPU development by reducing dependency on complex CUDA Toolkit installations and fragile GPU environments.
+Its goal is to make GPU programming feel like using a standard Python library: installable via pip with minimal setup. PyGPUkit provides high-performance GPU kernels, memory management, and scheduling through a NumPy-like API and a Kubernetes-inspired resource model, allowing developers to use GPUs explicitly, predictably, and productively.
+
+> **Note:** PyGPUkit currently requires CUDA drivers and NVRTC. It is NOT a PyTorch/CuPy replacement—it's a lightweight runtime for custom GPU workloads, research, and real-time systems where full ML frameworks are overkill.
 
 ---
 
@@ -76,7 +78,7 @@ Its goal is to make GPU programming and model execution feel like using a standa
 ---
 
 ## Features
-- **Lightweight** — no PyTorch/CuPy overhead
+- **Lightweight** — smaller footprint than PyTorch/CuPy (not a replacement)
 - **Modular** — runtime / memory / scheduler / JIT / ops
 - **Rust Backend** — memory pool, scheduler, dispatch in Rust
 - **GPUArray** with NumPy interop

From 64f78e0d4e7643673bd934633f6ab23ef313a4be Mon Sep 17 00:00:00 2001
From: m96-chan <y_harada@technologies.moe>
Date: Sun, 14 Dec 2025 15:37:08 +0900
Subject: [PATCH 3/3] docs(readme): condense roadmap for released versions
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Replaced 37-line checklist with compact table for v0.1-v0.2.3.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---
 README.md | 48 +++++++++---------------------------------------
 1 file changed, 9 insertions(+), 39 deletions(-)

diff --git a/README.md b/README.md
index 6654619..7425765 100644
--- a/README.md
+++ b/README.md
@@ -323,45 +323,15 @@ PyGPUkit/
 
 ## Roadmap
 
-### **v0.1 (Released)**
-- [x] GPUArray
-- [x] NVRTC JIT
-- [x] add/mul/matmul ops
-- [x] Basic stream manager
-- [x] Packaging + wheels
-
-### **v0.2.0 (Released)**
-- [x] Rust Memory Pool (LRU, size-class)
-- [x] Rust Scheduler (priority, memory reservation)
-- [x] Rust Transfer Engine (async H2D/D2H)
-- [x] Rust Kernel Dispatch Controller
-- [x] Admission Control
-- [x] QoS Policy Framework (Guaranteed/Burstable/BestEffort)
-- [x] Kernel Pacing Engine
-- [x] Micro-Slicing Framework
-- [x] Pinned Memory Support
-- [x] Kernel Cache (PTX caching)
-- [x] GPU Partitioning
-- [x] Tiled Matmul (shared memory)
-- [x] 106 Rust tests
-
-### **v0.2.1 — Stabilization Phase (Released)**
-- [x] Admission / QoS spec finalization
-- [x] Python API inconsistency fixes
-- [x] Rust error propagation unification
-
-### **v0.2.2 — Performance Phase (Released)**
-- [x] Ampere-optimized SGEMM with cp.async pipeline
-- [x] 4-stage software pipelining for latency hiding
-- [x] float4 vectorized memory loads
-- [x] 18.2 TFLOPS on RTX 3090 Ti (46% efficiency)
-- [x] SM 80+ (Ampere) architecture requirement
-
-### **v0.2.3 — TF32 TensorCore Phase (Released)**
-- [x] TF32 TensorCore GEMM with PTX mma.sync
-- [x] cp.async double-buffered pipeline
-- [x] 27.5 TFLOPS on RTX 3090 Ti
-- [x] PTX fragment mapping documentation
+### **v0.1 — v0.2.3 (Released)**
+
+| Version | Highlights |
+|---------|------------|
+| **v0.1** | GPUArray, NVRTC JIT, add/mul/matmul, wheels |
+| **v0.2.0** | Rust scheduler (QoS, admission control, partitioning), memory pool (LRU), kernel cache, 106 Rust tests |
+| **v0.2.1** | API stabilization, error propagation |
+| **v0.2.2** | Ampere SGEMM (cp.async, float4), 18 TFLOPS FP32 |
+| **v0.2.3** | TF32 TensorCore (PTX mma.sync), 27.5 TFLOPS |
 
 ### **v0.2.4 — Benchmark & Reliability Phase**
 - [ ] Actual PyTorch/NumPy comparison benchmarks