Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 14 additions & 42 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,10 @@ PyGPUkit aims to be the "micro-runtime for GPU computing": small, fast, and idea

## Opening Paragraph (Goal Statement)

PyGPUkit aims to free developers from the complexity of CUDA Toolkit, Anaconda, and fragile GPU environments.
Its goal is to make GPU programming and model execution feel like using a standard Python library: installable via pip, minimal setup, and no mandatory external SDKs. PyGPUkit provides high-performance GPU kernels, memory management, scheduling, and model execution (e.g. SafeTensors) through a NumPy-like API and a Kubernetes-inspired resource model, allowing developers to use GPUs explicitly, predictably, and productively without fighting their environment.
PyGPUkit aims to simplify GPU development by reducing dependency on complex CUDA Toolkit installations and fragile GPU environments.
Its goal is to make GPU programming feel like using a standard Python library: installable via pip with minimal setup. PyGPUkit provides high-performance GPU kernels, memory management, and scheduling through a NumPy-like API and a Kubernetes-inspired resource model, allowing developers to use GPUs explicitly, predictably, and productively.

> **Note:** PyGPUkit currently requires CUDA drivers and NVRTC. It is NOT a PyTorch/CuPy replacement—it's a lightweight runtime for custom GPU workloads, research, and real-time systems where full ML frameworks are overkill.

---

Expand Down Expand Up @@ -76,7 +78,7 @@ Its goal is to make GPU programming and model execution feel like using a standa
---

## Features
- **Lightweight** — no PyTorch/CuPy overhead
- **Lightweight** — smaller footprint than PyTorch/CuPy (not a replacement)
- **Modular** — runtime / memory / scheduler / JIT / ops
- **Rust Backend** — memory pool, scheduler, dispatch in Rust
- **GPUArray** with NumPy interop
Expand Down Expand Up @@ -321,45 +323,15 @@ PyGPUkit/

## Roadmap

### **v0.1 (Released)**
- [x] GPUArray
- [x] NVRTC JIT
- [x] add/mul/matmul ops
- [x] Basic stream manager
- [x] Packaging + wheels

### **v0.2.0 (Released)**
- [x] Rust Memory Pool (LRU, size-class)
- [x] Rust Scheduler (priority, memory reservation)
- [x] Rust Transfer Engine (async H2D/D2H)
- [x] Rust Kernel Dispatch Controller
- [x] Admission Control
- [x] QoS Policy Framework (Guaranteed/Burstable/BestEffort)
- [x] Kernel Pacing Engine
- [x] Micro-Slicing Framework
- [x] Pinned Memory Support
- [x] Kernel Cache (PTX caching)
- [x] GPU Partitioning
- [x] Tiled Matmul (shared memory)
- [x] 106 Rust tests

### **v0.2.1 — Stabilization Phase (Released)**
- [x] Admission / QoS spec finalization
- [x] Python API inconsistency fixes
- [x] Rust error propagation unification

### **v0.2.2 — Performance Phase (Released)**
- [x] Ampere-optimized SGEMM with cp.async pipeline
- [x] 4-stage software pipelining for latency hiding
- [x] float4 vectorized memory loads
- [x] 18.2 TFLOPS on RTX 3090 Ti (46% efficiency)
- [x] SM 80+ (Ampere) architecture requirement

### **v0.2.3 — TF32 TensorCore Phase (Released)**
- [x] TF32 TensorCore GEMM with PTX mma.sync
- [x] cp.async double-buffered pipeline
- [x] 27.5 TFLOPS on RTX 3090 Ti
- [x] PTX fragment mapping documentation
### **v0.1 — v0.2.3 (Released)**

| Version | Highlights |
|---------|------------|
| **v0.1** | GPUArray, NVRTC JIT, add/mul/matmul, wheels |
| **v0.2.0** | Rust scheduler (QoS, admission control, partitioning), memory pool (LRU), kernel cache, 106 Rust tests |
| **v0.2.1** | API stabilization, error propagation |
| **v0.2.2** | Ampere SGEMM (cp.async, float4), 18 TFLOPS FP32 |
| **v0.2.3** | TF32 TensorCore (PTX mma.sync), 27.5 TFLOPS |

### **v0.2.4 — Benchmark & Reliability Phase**
- [ ] Actual PyTorch/NumPy comparison benchmarks
Expand Down
20 changes: 16 additions & 4 deletions native/jit/kernel.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -7,25 +7,37 @@
#include <vector>
#include <memory>

#ifdef PYGPUKIT_DRIVER_ONLY
// Driver-only mode: define our own Dim3 struct
struct Dim3 {
unsigned int x, y, z;
Dim3(unsigned int x_ = 1, unsigned int y_ = 1, unsigned int z_ = 1)
: x(x_), y(y_), z(z_) {}
};
#else
#include <cuda_runtime.h>
using Dim3 = dim3;
#endif

namespace pygpukit {

// Forward declaration
class JITKernel;

// Kernel launch configuration
struct LaunchConfig {
dim3 grid;
dim3 block;
Dim3 grid;
Dim3 block;
size_t shared_mem;
cudaStream_t stream;
StreamHandle stream;

LaunchConfig()
: grid(1), block(256), shared_mem(0), stream(nullptr) {}

LaunchConfig(unsigned int grid_x, unsigned int block_x)
: grid(grid_x), block(block_x), shared_mem(0), stream(nullptr) {}

LaunchConfig(dim3 g, dim3 b, size_t smem = 0, cudaStream_t s = nullptr)
LaunchConfig(Dim3 g, Dim3 b, size_t smem = 0, StreamHandle s = nullptr)
: grid(g), block(b), shared_mem(smem), stream(s) {}
};

Expand Down