Skip to content

[v0.2.5+] Support Types #58

@m96-chan

Description

@m96-chan

Supported Type Combinations

Input A Input B Accumulate Output Status Notes
FP32 FP32 FP32 FP32 ✅ Supported Standard CUDA FMA path
TF32 TF32 FP32 FP32 ✅ Supported TensorCore (Ampere+), matmul only
FP16 FP16 FP32 FP16 🔜 Planned TensorCore, mixed precision
FP16 FP16 FP32 FP32 🔜 Planned Common for training/inference
BF16 BF16 FP32 BF16 🔜 Planned Preferred for modern LLMs
BF16 BF16 FP32 FP32 🔜 Planned Higher numerical stability
FP16 FP16 FP16 FP16 ❌ Out of scope (v0.x) Low stability, niche use
INT8 INT8 INT32 FP32 ❌ Out of scope (v0.x) Requires calibration & quantization
Mixed (FP16/BF16/FP32) Mixed FP32 FP32 ❌ Not yet Explicit casts required

Current Implementation Status (v0.2.4)

Operation FP32 TF32 FP16 BF16
add 🔜 🔜
mul 🔜 🔜
matmul ✅ (18 TFLOPS) ✅ (27.38 TFLOPS) 🔜 🔜

Design Notes

  • FP32 and TF32 are the baseline and already supported.
  • FP16 and BF16 with FP32 accumulation are considered mandatory for practical ML workloads and are planned next.
  • Automatic mixed precision (AMP), loss scaling, and quantized INT8 execution are intentionally excluded from v0.x to keep the API simple and predictable.
  • Explicit cast() operations are preferred over implicit type promotion, following NumPy-style semantics.

Scope Clarification (Non-goals for v0.x)

  • Automatic AMP
  • INT8 / quantized kernels
  • Training-specific features (loss scaling, stochastic rounding)

These may be considered after the runtime becomes fully CUDA Toolkit–independent and the core API stabilizes.


Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions