SF16 Implementation

# SF16 Implementation Details

This document describes the SF16 implementation in this repository, focusing on what is actually compiled and executed when `SF16_TRUE_FORWARD=1` is enabled.

## 1. Scope and Terminology

* **SF16** refers to Q1.15 boundary simulation (`simulate_q115`) applied to tensors while running on native GPU kernels.
* Q1.15 mode is enabled with `-DENABLE_Q115`.
* True-forward behavior is enabled with `-DSF16_TRUE_FORWARD=1`.

Important: this is not a pure int16 compute pipeline. In Q1.15 mode, the system uses BF16 (`__nv_bfloat16`) for storage and computation, while SF16 behavior is enforced through explicit quantize/dequantize steps at defined forward boundaries.

## 2. Core Numeric Format (Q1.15)

Q1.15 represents normalized values in the range `[-1, 1)` using signed 16-bit fixed-point semantics.

* Scale factor: (2^{15} = 32768)
* Resolution: (1 / 32768 \approx 3.05 \times 10^{-5})
* Integer range: `[-32768, 32767]`
* Float range: `[-1.0, 0.999969482421875]`

Implementation details:

* Overflow threshold: `Q115_OVERFLOW_THRESHOLD = 0.999969482421875`
* `float_to_q115` clamps values to `[-Q115_OVERFLOW_THRESHOLD, Q115_OVERFLOW_THRESHOLD]`
* `simulate_q115(x)` performs a quantize–dequantize roundtrip:

  ```
  q115_to_float(float_to_q115(x))
  ```

## 3. Storage and Compute Model

When `ENABLE_Q115` is active:

* Tensor storage type (`floatX`) is BF16 (`__nv_bfloat16`)
* cuBLAS low-precision datatype is BF16
* With `SF16_TRUE_FORWARD`, compute mode is:

  ```
  CUBLAS_COMPUTE_32F_FAST_16BF
  ```

  (with compatibility fallback where needed)

This design maintains high GPU performance while enforcing SF16 behavior at explicit forward boundaries.

## 4. Forward Boundary Enforcement

### 4.1 Matmul Outputs

Matmul outputs are quantized at the forward boundary using `q115_simulate_kernel`.

In `SF16_TRUE_FORWARD` mode, each value undergoes:

1. `simulate_q131(v)` (higher-precision register simulation)
2. `simulate_q115(v)` (SF16 storage boundary)

This kernel is applied to forward outputs (i.e., when not in backward mode) after cuBLASLt matmul.

### 4.2 Attention

Under Q1.15 mode:

* Dynamic attention scaling is neutralized:

  ```
  att_scale = 1.0f
  ```
* Softmax inputs are read through `simulate_q115(...)`
* Softmax outputs are quantized before storage:

  ```
  __stcs(..., (floatX)simulate_q115(ev * norm));
  ```

The standard attention temperature factor (1 / \sqrt{HS}) remains unchanged.

### 4.3 Logits

* With `SF16_TRUE_FORWARD`, logit scaling is disabled:

  ```
  Q115_LOGIT_SCALE = 1.0f
  ```
* Logit values are read through `simulate_q115(...)`

### 4.4 Normalization and Residual Paths

Normalization and residual outputs are clamped and quantized:

* Clamp range:

  ```
  [-Q115_OVERFLOW_THRESHOLD, +Q115_OVERFLOW_THRESHOLD]
  ```
* Outputs are written using `simulate_q115(...)` in forward kernels

### 4.5 Activation and Encoding

* Activation functions (e.g., GELU):

  * Inputs are read via `simulate_q115`
  * Outputs are written via `simulate_q115`
* Input embedding sums are quantized before storage:

  ```
  simulate_q115(wte + wpe)
  ```

## 5. cuDNN Interaction

To preserve strict SF16 boundary semantics, cuDNN is not supported in Q1.15 mode.

Compile-time enforcement:

```c
#if defined(ENABLE_Q115) && defined(ENABLE_CUDNN)
#error "ENABLE_CUDNN is not supported with ENABLE_Q115. Disable USE_CUDNN for SF16 builds."
#endif
```

Default configuration:

```
USE_CUDNN ?= 0
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SF16 Implementation #11

SF16 Implementation Details

1. Scope and Terminology

2. Core Numeric Format (Q1.15)

3. Storage and Compute Model

4. Forward Boundary Enforcement

4.1 Matmul Outputs

4.2 Attention

4.3 Logits

4.4 Normalization and Residual Paths

4.5 Activation and Encoding

5. cuDNN Interaction

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

SF16 Implementation #11

Description

SF16 Implementation Details

1. Scope and Terminology

2. Core Numeric Format (Q1.15)

3. Storage and Compute Model

4. Forward Boundary Enforcement

4.1 Matmul Outputs

4.2 Attention

4.3 Logits

4.4 Normalization and Residual Paths

4.5 Activation and Encoding

5. cuDNN Interaction

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions