Skip to content

SF16 Implementation #11

@aloshdenny

Description

@aloshdenny

SF16 Implementation Details

This document describes the SF16 implementation in this repository, focusing on what is actually compiled and executed when SF16_TRUE_FORWARD=1 is enabled.

1. Scope and Terminology

  • SF16 refers to Q1.15 boundary simulation (simulate_q115) applied to tensors while running on native GPU kernels.
  • Q1.15 mode is enabled with -DENABLE_Q115.
  • True-forward behavior is enabled with -DSF16_TRUE_FORWARD=1.

Important: this is not a pure int16 compute pipeline. In Q1.15 mode, the system uses BF16 (__nv_bfloat16) for storage and computation, while SF16 behavior is enforced through explicit quantize/dequantize steps at defined forward boundaries.

2. Core Numeric Format (Q1.15)

Q1.15 represents normalized values in the range [-1, 1) using signed 16-bit fixed-point semantics.

  • Scale factor: (2^{15} = 32768)
  • Resolution: (1 / 32768 \approx 3.05 \times 10^{-5})
  • Integer range: [-32768, 32767]
  • Float range: [-1.0, 0.999969482421875]

Implementation details:

  • Overflow threshold: Q115_OVERFLOW_THRESHOLD = 0.999969482421875

  • float_to_q115 clamps values to [-Q115_OVERFLOW_THRESHOLD, Q115_OVERFLOW_THRESHOLD]

  • simulate_q115(x) performs a quantize–dequantize roundtrip:

    q115_to_float(float_to_q115(x))
    

3. Storage and Compute Model

When ENABLE_Q115 is active:

  • Tensor storage type (floatX) is BF16 (__nv_bfloat16)

  • cuBLAS low-precision datatype is BF16

  • With SF16_TRUE_FORWARD, compute mode is:

    CUBLAS_COMPUTE_32F_FAST_16BF
    

    (with compatibility fallback where needed)

This design maintains high GPU performance while enforcing SF16 behavior at explicit forward boundaries.

4. Forward Boundary Enforcement

4.1 Matmul Outputs

Matmul outputs are quantized at the forward boundary using q115_simulate_kernel.

In SF16_TRUE_FORWARD mode, each value undergoes:

  1. simulate_q131(v) (higher-precision register simulation)
  2. simulate_q115(v) (SF16 storage boundary)

This kernel is applied to forward outputs (i.e., when not in backward mode) after cuBLASLt matmul.

4.2 Attention

Under Q1.15 mode:

  • Dynamic attention scaling is neutralized:

    att_scale = 1.0f
    
  • Softmax inputs are read through simulate_q115(...)

  • Softmax outputs are quantized before storage:

    __stcs(..., (floatX)simulate_q115(ev * norm));
    

The standard attention temperature factor (1 / \sqrt{HS}) remains unchanged.

4.3 Logits

  • With SF16_TRUE_FORWARD, logit scaling is disabled:

    Q115_LOGIT_SCALE = 1.0f
    
  • Logit values are read through simulate_q115(...)

4.4 Normalization and Residual Paths

Normalization and residual outputs are clamped and quantized:

  • Clamp range:

    [-Q115_OVERFLOW_THRESHOLD, +Q115_OVERFLOW_THRESHOLD]
    
  • Outputs are written using simulate_q115(...) in forward kernels

4.5 Activation and Encoding

  • Activation functions (e.g., GELU):

    • Inputs are read via simulate_q115
    • Outputs are written via simulate_q115
  • Input embedding sums are quantized before storage:

    simulate_q115(wte + wpe)
    

5. cuDNN Interaction

To preserve strict SF16 boundary semantics, cuDNN is not supported in Q1.15 mode.

Compile-time enforcement:

#if defined(ENABLE_Q115) && defined(ENABLE_CUDNN)
#error "ENABLE_CUDNN is not supported with ENABLE_Q115. Disable USE_CUDNN for SF16 builds."
#endif

Default configuration:

USE_CUDNN ?= 0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions