SF16 Implementation Details
This document describes the SF16 implementation in this repository, focusing on what is actually compiled and executed when SF16_TRUE_FORWARD=1 is enabled.
1. Scope and Terminology
- SF16 refers to Q1.15 boundary simulation (
simulate_q115) applied to tensors while running on native GPU kernels.
- Q1.15 mode is enabled with
-DENABLE_Q115.
- True-forward behavior is enabled with
-DSF16_TRUE_FORWARD=1.
Important: this is not a pure int16 compute pipeline. In Q1.15 mode, the system uses BF16 (__nv_bfloat16) for storage and computation, while SF16 behavior is enforced through explicit quantize/dequantize steps at defined forward boundaries.
2. Core Numeric Format (Q1.15)
Q1.15 represents normalized values in the range [-1, 1) using signed 16-bit fixed-point semantics.
- Scale factor: (2^{15} = 32768)
- Resolution: (1 / 32768 \approx 3.05 \times 10^{-5})
- Integer range:
[-32768, 32767]
- Float range:
[-1.0, 0.999969482421875]
Implementation details:
-
Overflow threshold: Q115_OVERFLOW_THRESHOLD = 0.999969482421875
-
float_to_q115 clamps values to [-Q115_OVERFLOW_THRESHOLD, Q115_OVERFLOW_THRESHOLD]
-
simulate_q115(x) performs a quantize–dequantize roundtrip:
q115_to_float(float_to_q115(x))
3. Storage and Compute Model
When ENABLE_Q115 is active:
-
Tensor storage type (floatX) is BF16 (__nv_bfloat16)
-
cuBLAS low-precision datatype is BF16
-
With SF16_TRUE_FORWARD, compute mode is:
CUBLAS_COMPUTE_32F_FAST_16BF
(with compatibility fallback where needed)
This design maintains high GPU performance while enforcing SF16 behavior at explicit forward boundaries.
4. Forward Boundary Enforcement
4.1 Matmul Outputs
Matmul outputs are quantized at the forward boundary using q115_simulate_kernel.
In SF16_TRUE_FORWARD mode, each value undergoes:
simulate_q131(v) (higher-precision register simulation)
simulate_q115(v) (SF16 storage boundary)
This kernel is applied to forward outputs (i.e., when not in backward mode) after cuBLASLt matmul.
4.2 Attention
Under Q1.15 mode:
-
Dynamic attention scaling is neutralized:
-
Softmax inputs are read through simulate_q115(...)
-
Softmax outputs are quantized before storage:
__stcs(..., (floatX)simulate_q115(ev * norm));
The standard attention temperature factor (1 / \sqrt{HS}) remains unchanged.
4.3 Logits
4.4 Normalization and Residual Paths
Normalization and residual outputs are clamped and quantized:
4.5 Activation and Encoding
5. cuDNN Interaction
To preserve strict SF16 boundary semantics, cuDNN is not supported in Q1.15 mode.
Compile-time enforcement:
#if defined(ENABLE_Q115) && defined(ENABLE_CUDNN)
#error "ENABLE_CUDNN is not supported with ENABLE_Q115. Disable USE_CUDNN for SF16 builds."
#endif
Default configuration:
SF16 Implementation Details
This document describes the SF16 implementation in this repository, focusing on what is actually compiled and executed when
SF16_TRUE_FORWARD=1is enabled.1. Scope and Terminology
simulate_q115) applied to tensors while running on native GPU kernels.-DENABLE_Q115.-DSF16_TRUE_FORWARD=1.Important: this is not a pure int16 compute pipeline. In Q1.15 mode, the system uses BF16 (
__nv_bfloat16) for storage and computation, while SF16 behavior is enforced through explicit quantize/dequantize steps at defined forward boundaries.2. Core Numeric Format (Q1.15)
Q1.15 represents normalized values in the range
[-1, 1)using signed 16-bit fixed-point semantics.[-32768, 32767][-1.0, 0.999969482421875]Implementation details:
Overflow threshold:
Q115_OVERFLOW_THRESHOLD = 0.999969482421875float_to_q115clamps values to[-Q115_OVERFLOW_THRESHOLD, Q115_OVERFLOW_THRESHOLD]simulate_q115(x)performs a quantize–dequantize roundtrip:3. Storage and Compute Model
When
ENABLE_Q115is active:Tensor storage type (
floatX) is BF16 (__nv_bfloat16)cuBLAS low-precision datatype is BF16
With
SF16_TRUE_FORWARD, compute mode is:(with compatibility fallback where needed)
This design maintains high GPU performance while enforcing SF16 behavior at explicit forward boundaries.
4. Forward Boundary Enforcement
4.1 Matmul Outputs
Matmul outputs are quantized at the forward boundary using
q115_simulate_kernel.In
SF16_TRUE_FORWARDmode, each value undergoes:simulate_q131(v)(higher-precision register simulation)simulate_q115(v)(SF16 storage boundary)This kernel is applied to forward outputs (i.e., when not in backward mode) after cuBLASLt matmul.
4.2 Attention
Under Q1.15 mode:
Dynamic attention scaling is neutralized:
Softmax inputs are read through
simulate_q115(...)Softmax outputs are quantized before storage:
The standard attention temperature factor (1 / \sqrt{HS}) remains unchanged.
4.3 Logits
With
SF16_TRUE_FORWARD, logit scaling is disabled:Logit values are read through
simulate_q115(...)4.4 Normalization and Residual Paths
Normalization and residual outputs are clamped and quantized:
Clamp range:
Outputs are written using
simulate_q115(...)in forward kernels4.5 Activation and Encoding
Activation functions (e.g., GELU):
simulate_q115simulate_q115Input embedding sums are quantized before storage:
5. cuDNN Interaction
To preserve strict SF16 boundary semantics, cuDNN is not supported in Q1.15 mode.
Compile-time enforcement:
Default configuration: