[v0.2.5+] Support Types

## Supported Type Combinations

| Input A | Input B | Accumulate | Output | Status | Notes |
|---------|---------|------------|--------|--------|-------|
| FP32 | FP32 | FP32 | FP32 | ✅ Supported | Standard CUDA FMA path |
| TF32 | TF32 | FP32 | FP32 | ✅ Supported | TensorCore (Ampere+), matmul only |
| FP16 | FP16 | FP32 | FP16 | 🔜 Planned | TensorCore, mixed precision |
| FP16 | FP16 | FP32 | FP32 | 🔜 Planned | Common for training/inference |
| BF16 | BF16 | FP32 | BF16 | 🔜 Planned | Preferred for modern LLMs |
| BF16 | BF16 | FP32 | FP32 | 🔜 Planned | Higher numerical stability |
| FP16 | FP16 | FP16 | FP16 | ❌ Out of scope (v0.x) | Low stability, niche use |
| INT8 | INT8 | INT32 | FP32 | ❌ Out of scope (v0.x) | Requires calibration & quantization |
| Mixed (FP16/BF16/FP32) | Mixed | FP32 | FP32 | ❌ Not yet | Explicit casts required |

---

## Current Implementation Status (v0.2.4)

| Operation | FP32 | TF32 | FP16 | BF16 |
|-----------|------|------|------|------|
| add | ✅ | — | 🔜 | 🔜 |
| mul | ✅ | — | 🔜 | 🔜 |
| matmul | ✅ (18 TFLOPS) | ✅ (27.38 TFLOPS) | 🔜 | 🔜 |

---

## Design Notes

- FP32 and TF32 are the baseline and already supported.
- FP16 and BF16 with FP32 accumulation are considered mandatory for practical ML workloads and are planned next.
- Automatic mixed precision (AMP), loss scaling, and quantized INT8 execution are intentionally excluded from v0.x to keep the API simple and predictable.
- Explicit cast() operations are preferred over implicit type promotion, following NumPy-style semantics.

---

## Scope Clarification (Non-goals for v0.x)

- Automatic AMP
- INT8 / quantized kernels
- Training-specific features (loss scaling, stochastic rounding)

These may be considered after the runtime becomes fully CUDA Toolkit–independent and the core API stabilizes.

---

## Related Issues

- #59 Support Operator (operation coverage)
- #53 TF32 Kernel Further Optimization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v0.2.5+] Support Types #58

Supported Type Combinations

Current Implementation Status (v0.2.4)

Design Notes

Scope Clarification (Non-goals for v0.x)

Related Issues

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Input A	Input B	Accumulate	Output	Status	Notes
FP32	FP32	FP32	FP32	✅ Supported	Standard CUDA FMA path
TF32	TF32	FP32	FP32	✅ Supported	TensorCore (Ampere+), matmul only
FP16	FP16	FP32	FP16	🔜 Planned	TensorCore, mixed precision
FP16	FP16	FP32	FP32	🔜 Planned	Common for training/inference
BF16	BF16	FP32	BF16	🔜 Planned	Preferred for modern LLMs
BF16	BF16	FP32	FP32	🔜 Planned	Higher numerical stability
FP16	FP16	FP16	FP16	❌ Out of scope (v0.x)	Low stability, niche use
INT8	INT8	INT32	FP32	❌ Out of scope (v0.x)	Requires calibration & quantization
Mixed (FP16/BF16/FP32)	Mixed	FP32	FP32	❌ Not yet	Explicit casts required

Operation	FP32	TF32	FP16	BF16
add	✅	—	🔜	🔜
mul	✅	—	🔜	🔜
matmul	✅ (18 TFLOPS)	✅ (27.38 TFLOPS)	🔜	🔜

[v0.2.5+] Support Types #58

Description

Supported Type Combinations

Current Implementation Status (v0.2.4)

Design Notes

Scope Clarification (Non-goals for v0.x)

Related Issues

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions