Supported Type Combinations
| Input A |
Input B |
Accumulate |
Output |
Status |
Notes |
| FP32 |
FP32 |
FP32 |
FP32 |
✅ Supported |
Standard CUDA FMA path |
| TF32 |
TF32 |
FP32 |
FP32 |
✅ Supported |
TensorCore (Ampere+), matmul only |
| FP16 |
FP16 |
FP32 |
FP16 |
🔜 Planned |
TensorCore, mixed precision |
| FP16 |
FP16 |
FP32 |
FP32 |
🔜 Planned |
Common for training/inference |
| BF16 |
BF16 |
FP32 |
BF16 |
🔜 Planned |
Preferred for modern LLMs |
| BF16 |
BF16 |
FP32 |
FP32 |
🔜 Planned |
Higher numerical stability |
| FP16 |
FP16 |
FP16 |
FP16 |
❌ Out of scope (v0.x) |
Low stability, niche use |
| INT8 |
INT8 |
INT32 |
FP32 |
❌ Out of scope (v0.x) |
Requires calibration & quantization |
| Mixed (FP16/BF16/FP32) |
Mixed |
FP32 |
FP32 |
❌ Not yet |
Explicit casts required |
Current Implementation Status (v0.2.4)
| Operation |
FP32 |
TF32 |
FP16 |
BF16 |
| add |
✅ |
— |
🔜 |
🔜 |
| mul |
✅ |
— |
🔜 |
🔜 |
| matmul |
✅ (18 TFLOPS) |
✅ (27.38 TFLOPS) |
🔜 |
🔜 |
Design Notes
- FP32 and TF32 are the baseline and already supported.
- FP16 and BF16 with FP32 accumulation are considered mandatory for practical ML workloads and are planned next.
- Automatic mixed precision (AMP), loss scaling, and quantized INT8 execution are intentionally excluded from v0.x to keep the API simple and predictable.
- Explicit cast() operations are preferred over implicit type promotion, following NumPy-style semantics.
Scope Clarification (Non-goals for v0.x)
- Automatic AMP
- INT8 / quantized kernels
- Training-specific features (loss scaling, stochastic rounding)
These may be considered after the runtime becomes fully CUDA Toolkit–independent and the core API stabilizes.
Related Issues
Supported Type Combinations
Current Implementation Status (v0.2.4)
Design Notes
Scope Clarification (Non-goals for v0.x)
These may be considered after the runtime becomes fully CUDA Toolkit–independent and the core API stabilizes.
Related Issues