NestedFP: High-Performance, Memory-Efficient Dual-Precision Floating Point Support for LLMs [Paper]
NestedFP is a high-performance, memory-efficient dual-precision framework for LLM serving that supports both FP8 and FP16 inference from a single FP16 model without additional memory overhead. It introduces a lightweight FP16 → (FP8 + residual) decomposition and CUTLASS-based custom kernels integrated into vLLM, delivering FP8 accuracy on par with standard quantized FP8 models while preserving full FP16 precision. NestedFP further enables dynamic, SLO-aware serving by allowing runtime precision selection.
- CUDA 12.6
Note: Tested on Ubuntu 22.04 with GCC/G++ 11.4.0 and 12.3.0
sudo apt install ninja-build # For faster compilationRun the following command to install NestedFP and its dependencies:
./install.shNestedFP/
├── vllm/ # vLLM source with NestedFP modifications
├── cutlass/ # CUTLASS source with custom kernels
├── nestedfp/ # Python–C++ interface and build scripts for custom CUTLASS kernels
├── scripts/
│ ├── acc_eval.sh # accuracy evaluation script
│ ├── vllm_simple_server.py # vLLM server launcher for streaming requests
│ ├── vllm_simple_client.py # vLLM client for sending requests
│ └── kernel/
│ ├── run_fp16_single.sh # FP16 kernel search (single GPU)
│ └── run_fp16_multi.sh # FP16 kernel search (multi GPU)
└── example/ # example usage scripts
NestedFP requires explicitly selecting the precision mode before each experiment by modifying two files:
NestedFP/vllm/vllm/model_executor/layers/quantization/nestedfp.pyNestedFP/vllm/vllm/v1/core/sched/scheduler.py
| Mode | nestedfp.py Line 93 |
nestedfp.py Line 95 |
scheduler.py Lines 445-447 |
|---|---|---|---|
| FP8 | ✅ Uncomment | ❌ Comment | ❌ Comment |
| FP16 | ❌ Comment | ✅ Uncomment | ❌ Comment |
| Dynamic Precision Selection | ✅ Uncomment | ❌ Comment | ✅ Uncomment |
Note: Only one precision mode can be active at a time. If performance results are inconsistent, clear
.cache/vllm/torch_compile_cache/after switching experiment modes.
For accuracy evaluation, configure the precision mode to FP8 (see Precision Mode Configuration).
Command Format:
./scripts/acc_eval.sh <GPU_ID> <MODEL_PATH> <TASK> --nestedfpParameters:
<GPU_ID>— GPU index to use for evaluation<MODEL_PATH>— Path to the model directory<TASK>— lm_eval task to run
./scripts/acc_eval.sh 0 Mistral-Small-24B-Base-2501 bbh_zeroshot --nestedfpAll evaluation results will be saved to: ./results/acc_eval/
The kernel search script benchmarks 80 candidate CUTLASS kernels for each target GEMM configuration (including batch size) on NVIDIA H100 GPUs and reports their performance for manual kernel selection.
Optimal kernels for each GEMM shape:
You can find the reference mapping from GEMM shapes to their optimal CUTLASS kernels in our customized vLLM at: NestedFP/vllm/vllm/model_executor/layers/quantization/utils/nestedfp_utils.py
For kernel search, configure the precision mode to FP16 (see Precision Mode Configuration).
Command Format:
./scripts/kernel/run_fp16_single.sh N K GPU M_START M_ENDParameters:
N— N dimension of the GEMM shapeK— K dimension of the GEMM shapeGPU— GPU index to use for the searchM_START— Starting M dimension for the search rangeM_END— Ending M dimension for the search range
FP16 Kernel Search:
./scripts/kernel/run_fp16_single.sh 5120 32768 0 32 2048This command searches for the optimal kernel with:
- N = 5120, K = 32768
- GPU 0
- M dimension range: 32 to 2048 (in steps of 32)
For throughput test, configure the precision mode to FP16 (see Precision Mode Configuration).
First, load your model using the following command:
python scripts/vllm_simple_server.py \
--model <MODEL_PATH> \
--max-num-batched-tokens 8192 \
--port 8000Parameters:
<MODEL_PATH>— Path to the model directory--quantization nestedfp— Enables NestedFP mode (omit this flag for baseline FP16 mode)
When the model loads successfully, you'll see:
INFO: Started server process [200291]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
In a separate terminal, execute:
python scripts/vllm_simple_client.py \
--model <MODEL_PATH> \
--api-url http://0.0.0.0:8000/v1/completions \
--test-mode throughputParameters:
<MODEL_PATH>— Path to the model directory--nestedfp— Enables NestedFP mode (omit this flag for baseline FP16 mode)
Vanilla FP16 Execution:
# Start server
python scripts/vllm_simple_server.py \
--model Mistral-Small-24B-Base-2501 \
--max-num-batched-tokens 8192 \
--port 8000
# Run client
python scripts/vllm_simple_client.py \
--model Mistral-Small-24B-Base-2501 \
--api-url http://0.0.0.0:8000/v1/completions \
--test-mode throughputNestedFP FP16 Execution:
# Start server with NestedFP
python scripts/vllm_simple_server.py \
--model Mistral-Small-24B-Base-2501 \
--max-num-batched-tokens 8192 \
--port 8000 \
--quantization nestedfp
# Run client with NestedFP
python scripts/vllm_simple_client.py \
--model Mistral-Small-24B-Base-2501 \
--api-url http://0.0.0.0:8000/v1/completions \
--test-mode throughput \
--nestedfpYou can customize the test parameters by passing additional options to vllm_simple_client.py:
- Input/output token length
- Batch size
See the script's help documentation for all available options.
For the dynamic precision selection experiment, configure the precision mode to Dynamic Precision Selection (see Precision Mode Configuration). Trace files are provided in the trace directory. Precision automatically switches to FP8 when the number of batched tokens exceeds 1024.
Load your model using the following command:
python scripts/vllm_simple_server.py \
--model <MODEL_PATH> \
--max-num-batched-tokens 2048 \
--port 8000 \
--tensor-parallel-size <TP_SIZE>Parameters:
<MODEL_PATH>— Path to the model directory<TP_SIZE>— Tensor parallel size--quantization nestedfp— Enables NestedFP mode (omit this flag for baseline FP16 mode)
When the model loads successfully, you'll see:
INFO: Started server process [200291]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
In a separate terminal, execute:
python scripts/vllm_simple_client.py \
--model <MODEL_PATH> \
--api-url http://0.0.0.0:8000/v1/completions \
--num-requests 1000 \
--middle-ratio 0.7 \
--test-mode trace \
--trace-file <TRACE_FILE>Parameters:
<MODEL_PATH>— Path to the model directory<TRACE_FILE>— Path to the trace file--nestedfp— Enables NestedFP mode (omit this flag for baseline FP16 mode)
Vanilla Trace-based Test:
# Start server
python scripts/vllm_simple_server.py \
--model Llama-3.1-70B \
--max-num-batched-tokens 2048 \
--port 8000 \
--tensor-parallel-size 4
# Run client
python scripts/vllm_simple_client.py \
--model Llama-3.1-70B \
--api-url http://0.0.0.0:8000/v1/completions \
--num-requests 1000 \
--middle-ratio 0.7 \
--test-mode trace \
--trace-file trace/azure_conv_0514_1400_20min_10.0x_tc.csvNestedFP Trace-based Test:
# Start server with NestedFP
python scripts/vllm_simple_server.py \
--model Llama-3.1-70B \
--max-num-batched-tokens 2048 \
--port 8010 \
--tensor-parallel-size 4 \
--quantization nestedfp
# Run client with NestedFP
python scripts/vllm_simple_client.py \
--model Llama-3.1-70B \
--api-url http://0.0.0.0:8010/v1/completions \
--num-requests 1000 \
--middle-ratio 0.7 \
--test-mode trace \
--trace-file trace/azure_conv_0514_1400_20min_10.0x_tc.csv \
--nestedfpPlease cite our paper if you find our work useful:
@inproceedings{lee2025nestedfp,
title={NestedFP: High-Performance, Memory-Efficient Dual-Precision Floating Point Support for LLMs},
author={Haeun Lee and Omin Kwon and Yeonhong Park and Jae W. Lee},
year={2025},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems}
}