Skip to content

feat(v0.2): Complete v0.2 Feature Set - Tiled Matmul, Rust Components, Driver-Only Mode#32

Merged
m96-chan merged 1 commit intomainfrom
feature/v0.2-tiled-matmul
Dec 12, 2025
Merged

feat(v0.2): Complete v0.2 Feature Set - Tiled Matmul, Rust Components, Driver-Only Mode#32
m96-chan merged 1 commit intomainfrom
feature/v0.2-tiled-matmul

Conversation

@m96-chan
Copy link
Copy Markdown
Owner

Summary

  • Tiled Matmul Kernel: 64x64 shared memory tiles with double-buffered prefetch (~7 TFLOPS on RTX 3090 Ti)
  • Rust Memory Pool: LRU eviction, size-class free lists, thread-safe with parking_lot
  • Rust Scheduler: Priority-based task queue with memory reservation
  • Rust Async Transfer Engine: Separate H2D/D2H streams with priority ordering
  • Rust Kernel Dispatch Controller: Per-stream in-flight limits with scheduler integration
  • Driver-Only Mode: CUDA Driver API infrastructure (no cudart dependency when enabled)

Benchmark Results (RTX 3090 Ti)

Matrix Size Kernel Time (ms) GFLOPS vs NumPy
512x512 L2-opt 0.23 1180.5 3.6x
1024x1024 L2-opt 1.48 1446.2 2.2x
2048x2048 Tiled 3.83 4487.5 6.1x
4096x4096 Tiled 19.80 6941.8 7.6x

Test plan

  • Run python examples/demo_v02.py - All 6 sections pass
  • Tiled matmul correctness verified (rel_error < 1e-3)
  • CI build passes (Linux + Windows)
  • Driver-Only mode test on self-hosted runner

🤖 Generated with Claude Code

…Only Mode

v0.2 Complete Feature Set:

1. Rust Async Transfer Engine
   - Separate H2D/D2H streams with priority ordering
   - Concurrent transfer limits
   - Transfer lifecycle management

2. Rust Kernel Dispatch Controller
   - Per-stream in-flight limits
   - Scheduler task integration
   - Launch request queuing

3. Driver-Only Mode (C++)
   - CUDA Driver API wrappers (driver_api.hpp)
   - Unified context management (driver_context.hpp)
   - Conditional compilation with PYGPUKIT_DRIVER_ONLY
   - No cudart dependency when enabled

4. CI Integration
   - Added test-driver-only-windows job
   - Self-hosted runner with GPU for validation

5. Demo & Benchmarks
   - examples/demo_v02.py: Full feature demonstration
   - Peak: 6941.8 GFLOPS on RTX 3090 Ti (Tiled Matmul)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@m96-chan m96-chan merged commit 7978879 into main Dec 12, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant