NestedFP: High-Performance, Memory-Efficient Dual-Precision Floating Point Support for LLMs [Paper]

Overview

NestedFP is a high-performance, memory-efficient dual-precision framework for LLM serving that supports both FP8 and FP16 inference from a single FP16 model without additional memory overhead. It introduces a lightweight FP16 → (FP8 + residual) decomposition and CUTLASS-based custom kernels integrated into vLLM, delivering FP8 accuracy on par with standard quantized FP8 models while preserving full FP16 precision. NestedFP further enables dynamic, SLO-aware serving by allowing runtime precision selection.

Requirements

CUDA 12.6

Note: Tested on Ubuntu 22.04 with GCC/G++ 11.4.0 and 12.3.0

Optional

sudo apt install ninja-build  # For faster compilation

Setup

Run the following command to install NestedFP and its dependencies:

./install.sh

Repository Layout

NestedFP/
├── vllm/                      # vLLM source with NestedFP modifications
├── cutlass/                   # CUTLASS source with custom kernels
├── nestedfp/                  # Python–C++ interface and build scripts for custom CUTLASS kernels
├── scripts/
│   ├── acc_eval.sh            # accuracy evaluation script
│   ├── vllm_simple_server.py  # vLLM server launcher for streaming requests
│   ├── vllm_simple_client.py  # vLLM client for sending requests
│   └── kernel/
│       ├── run_fp16_single.sh # FP16 kernel search (single GPU)
│       └── run_fp16_multi.sh  # FP16 kernel search (multi GPU)
└── example/                   # example usage scripts

Precision Mode Configuration

NestedFP requires explicitly selecting the precision mode before each experiment by modifying two files:

NestedFP/vllm/vllm/model_executor/layers/quantization/nestedfp.py
NestedFP/vllm/vllm/v1/core/sched/scheduler.py

Mode	nestedfp.py Line 93	nestedfp.py Line 95	scheduler.py Lines 445-447
FP8	✅ Uncomment	❌ Comment	❌ Comment
FP16	❌ Comment	✅ Uncomment	❌ Comment
Dynamic Precision Selection	✅ Uncomment	❌ Comment	✅ Uncomment

Note: Only one precision mode can be active at a time. If performance results are inconsistent, clear .cache/vllm/torch_compile_cache/ after switching experiment modes.

Accuracy Evaluation

Running Accuracy Evaluation

For accuracy evaluation, configure the precision mode to FP8 (see Precision Mode Configuration).

Command Format:

./scripts/acc_eval.sh <GPU_ID> <MODEL_PATH> <TASK> --nestedfp

Parameters:

<GPU_ID> — GPU index to use for evaluation
<MODEL_PATH> — Path to the model directory
<TASK> — lm_eval task to run

Example

./scripts/acc_eval.sh 0 Mistral-Small-24B-Base-2501 bbh_zeroshot --nestedfp

Output

All evaluation results will be saved to: ./results/acc_eval/

Kernel Search

Overview

The kernel search script benchmarks 80 candidate CUTLASS kernels for each target GEMM configuration (including batch size) on NVIDIA H100 GPUs and reports their performance for manual kernel selection.

Optimal kernels for each GEMM shape:
You can find the reference mapping from GEMM shapes to their optimal CUTLASS kernels in our customized vLLM at: NestedFP/vllm/vllm/model_executor/layers/quantization/utils/nestedfp_utils.py

Running Kernel Search

For kernel search, configure the precision mode to FP16 (see Precision Mode Configuration).

Command Format:

./scripts/kernel/run_fp16_single.sh N K GPU M_START M_END

Parameters:

N — N dimension of the GEMM shape
K — K dimension of the GEMM shape
GPU — GPU index to use for the search
M_START — Starting M dimension for the search range
M_END — Ending M dimension for the search range

Example

FP16 Kernel Search:

./scripts/kernel/run_fp16_single.sh 5120 32768 0 32 2048

This command searches for the optimal kernel with:

N = 5120, K = 32768
GPU 0
M dimension range: 32 to 2048 (in steps of 32)

Throughput Test

Running the Test

For throughput test, configure the precision mode to FP16 (see Precision Mode Configuration).

1. Start the Model Server

First, load your model using the following command:

python scripts/vllm_simple_server.py \
  --model <MODEL_PATH> \
  --max-num-batched-tokens 8192 \
  --port 8000

Parameters:

<MODEL_PATH> — Path to the model directory
--quantization nestedfp — Enables NestedFP mode (omit this flag for baseline FP16 mode)

When the model loads successfully, you'll see:

INFO:     Started server process [200291]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

2. Run the Throughput Test

In a separate terminal, execute:

python scripts/vllm_simple_client.py \
  --model <MODEL_PATH> \
  --api-url http://0.0.0.0:8000/v1/completions \
  --test-mode throughput

Parameters:

<MODEL_PATH> — Path to the model directory
--nestedfp — Enables NestedFP mode (omit this flag for baseline FP16 mode)

Examples

Vanilla FP16 Execution:

# Start server
python scripts/vllm_simple_server.py \
  --model Mistral-Small-24B-Base-2501 \
  --max-num-batched-tokens 8192 \
  --port 8000

# Run client
python scripts/vllm_simple_client.py \
  --model Mistral-Small-24B-Base-2501 \
  --api-url http://0.0.0.0:8000/v1/completions \
  --test-mode throughput

NestedFP FP16 Execution:

# Start server with NestedFP
python scripts/vllm_simple_server.py \
  --model Mistral-Small-24B-Base-2501 \
  --max-num-batched-tokens 8192 \
  --port 8000 \
  --quantization nestedfp

# Run client with NestedFP
python scripts/vllm_simple_client.py \
  --model Mistral-Small-24B-Base-2501 \
  --api-url http://0.0.0.0:8000/v1/completions \
  --test-mode throughput \
  --nestedfp

Additional Options

You can customize the test parameters by passing additional options to vllm_simple_client.py:

Input/output token length
Batch size

See the script's help documentation for all available options.

Dynamic Precision Selection

For the dynamic precision selection experiment, configure the precision mode to Dynamic Precision Selection (see Precision Mode Configuration). Trace files are provided in the trace directory. Precision automatically switches to FP8 when the number of batched tokens exceeds 1024.

1. Start the Model Server

Load your model using the following command:

python scripts/vllm_simple_server.py \
  --model <MODEL_PATH> \
  --max-num-batched-tokens 2048 \
  --port 8000 \
  --tensor-parallel-size <TP_SIZE>

Parameters:

<MODEL_PATH> — Path to the model directory
<TP_SIZE> — Tensor parallel size
--quantization nestedfp — Enables NestedFP mode (omit this flag for baseline FP16 mode)

When the model loads successfully, you'll see:

INFO:     Started server process [200291]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

2. Run the Client

In a separate terminal, execute:

python scripts/vllm_simple_client.py \
  --model <MODEL_PATH> \
  --api-url http://0.0.0.0:8000/v1/completions \
  --num-requests 1000 \
  --middle-ratio 0.7 \
  --test-mode trace \
  --trace-file <TRACE_FILE>

Parameters:

<MODEL_PATH> — Path to the model directory
<TRACE_FILE> — Path to the trace file
--nestedfp — Enables NestedFP mode (omit this flag for baseline FP16 mode)

Examples

Vanilla Trace-based Test:

# Start server
python scripts/vllm_simple_server.py \
  --model Llama-3.1-70B \
  --max-num-batched-tokens 2048 \
  --port 8000 \
  --tensor-parallel-size 4

# Run client
python scripts/vllm_simple_client.py \
  --model Llama-3.1-70B \
  --api-url http://0.0.0.0:8000/v1/completions \
  --num-requests 1000 \
  --middle-ratio 0.7 \
  --test-mode trace \
  --trace-file trace/azure_conv_0514_1400_20min_10.0x_tc.csv

NestedFP Trace-based Test:

# Start server with NestedFP
python scripts/vllm_simple_server.py \
  --model Llama-3.1-70B \
  --max-num-batched-tokens 2048 \
  --port 8010 \
  --tensor-parallel-size 4 \
  --quantization nestedfp

# Run client with NestedFP
python scripts/vllm_simple_client.py \
  --model Llama-3.1-70B \
  --api-url http://0.0.0.0:8010/v1/completions \
  --num-requests 1000 \
  --middle-ratio 0.7 \
  --test-mode trace \
  --trace-file trace/azure_conv_0514_1400_20min_10.0x_tc.csv \
  --nestedfp

Citation

Please cite our paper if you find our work useful:

@inproceedings{lee2025nestedfp,
  title={NestedFP: High-Performance, Memory-Efficient Dual-Precision Floating Point Support for LLMs},
  author={Haeun Lee and Omin Kwon and Yeonhong Park and Jae W. Lee},
  year={2025},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NestedFP: High-Performance, Memory-Efficient Dual-Precision Floating Point Support for LLMs [Paper]

Overview

Requirements

Optional

Setup

Repository Layout

Precision Mode Configuration

Accuracy Evaluation

Running Accuracy Evaluation

Example

Output

Kernel Search

Overview

Running Kernel Search

Example

Throughput Test

Running the Test

1. Start the Model Server

2. Run the Throughput Test

Examples

Additional Options

Dynamic Precision Selection

1. Start the Model Server

2. Run the Client

Examples

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
cutlass		cutlass
example		example
nestedfp		nestedfp
scripts		scripts
trace		trace
vllm		vllm
.gitignore		.gitignore
README.md		README.md
install.sh		install.sh
nestedfp.yml		nestedfp.yml

SNU-ARC/NestedFP

Folders and files

Latest commit

History

Repository files navigation

NestedFP: High-Performance, Memory-Efficient Dual-Precision Floating Point Support for LLMs [Paper]

Overview

Requirements

Optional

Setup

Repository Layout

Precision Mode Configuration

Accuracy Evaluation

Running Accuracy Evaluation

Example

Output

Kernel Search

Overview

Running Kernel Search

Example

Throughput Test

Running the Test

1. Start the Model Server

2. Run the Throughput Test

Examples

Additional Options

Dynamic Precision Selection

1. Start the Model Server

2. Run the Client

Examples

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages