PyGPUkit v0.2.3 — Reliability Phase + TF32 TensorCore
PyGPUkit v0.2.3 focuses on stability, reproducibility, large-scale correctness, and introduces the first version of Tensor Core acceleration to meet the 22 TFLOPS performance target on Ampere GPUs (RTX 3090 Ti / A100).
✔️ Core Reliability Features
1. Kernel Cache (LRU) – Completion
- Persistent kernel selection cache
- Architecture-aware kernel fingerprinting (SM, register file, shared mem)
- LRU eviction policy with max-size limit
2. Driver-only Mode Stabilization
- NVRTC error handling & retry logic
- JIT warm-up cache
- Fallback path for mismatched PTX ISA
3. Cross-Platform Support (Windows + Linux)
- Uniform CMake configuration
- CUDA Driver API path resolver
- os.add_dll_directory on Windows
4. Large GPU Memory Stress Test
- Continuous alloc/free 16GB loop
- Fragmentation measurement API
- Memory pool corruption detection
⚡ New Additions for v0.2.3 (Tensor Core Roadmap)
5. TF32 TensorCore GEMM (Ampere+) — Phase 1
Goal: 22–30 TFLOPS on RTX 3090 Ti
Deliverables:
- Tensor Core WMMA API (TF32 input → FP32 accumulate)
- Kernel dispatcher:
- FP32 FMA fallback
- TF32 Tensor Core (if SM ≥ 80)
- Unit tests (1e-3 tolerance)
- Performance test (4096×4096, 8192×8192)
6. Architecture Scaling (3090 Ti → A100 → H100)
- TF32 kernel parameter adaptation
- Shared memory / register tuning per SM count
- High-end GPU detection logic
🎯 Performance Target
| GPU |
Target TFLOPS |
Notes |
| RTX 3090 Ti |
22–30 TFLOPS |
cuBLAS-equivalent |
| A100 |
40–60 TFLOPS |
TF32 native |
| H100 |
80+ TFLOPS |
BF16 path later |
🧩 Final v0.2.3 Structure
v0.2.3
├─ Reliability Core
│ ├─ Kernel Cache LRU
│ ├─ Driver-only Stabilization
│ ├─ Cross-platform Support
│ └─ Large Memory Fragmentation Test
│
└─ Tensor Core Line
├─ TF32 TensorCore GEMM (22–30 TFLOPS)
└─ Architecture Scaling (3090Ti → A100 → H100)
PyGPUkit v0.2.3 — Reliability Phase + TF32 TensorCore
PyGPUkit v0.2.3 focuses on stability, reproducibility, large-scale correctness, and introduces the first version of Tensor Core acceleration to meet the 22 TFLOPS performance target on Ampere GPUs (RTX 3090 Ti / A100).
✔️ Core Reliability Features
1. Kernel Cache (LRU) – Completion
2. Driver-only Mode Stabilization
3. Cross-Platform Support (Windows + Linux)
4. Large GPU Memory Stress Test
⚡ New Additions for v0.2.3 (Tensor Core Roadmap)
5. TF32 TensorCore GEMM (Ampere+) — Phase 1
Goal: 22–30 TFLOPS on RTX 3090 Ti
Deliverables:
6. Architecture Scaling (3090 Ti → A100 → H100)
🎯 Performance Target
🧩 Final v0.2.3 Structure
v0.2.3
├─ Reliability Core
│ ├─ Kernel Cache LRU
│ ├─ Driver-only Stabilization
│ ├─ Cross-platform Support
│ └─ Large Memory Fragmentation Test
│
└─ Tensor Core Line
├─ TF32 TensorCore GEMM (22–30 TFLOPS)
└─ Architecture Scaling (3090Ti → A100 → H100)