Skip to content
1 change: 1 addition & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ NVIDIA Model Optimizer Changelog (Linux)
- Add support for Kimi K2 Thinking model quantization from the original int4 checkpoint.
- Add support for ``params`` constraint based automatic neural architecture search in Minitron pruning (``mcore_minitron``) as an alternative to manual pruning (using ``export_config``). See `examples/pruning/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/pruning>`_ for more details on its usage.
- Add support for calibration data with multiple samples in ``npz`` format in the ONNX Autocast workflow.
- Add support for vLLM fakequant reload using ModelOpt state for HF models. See `examples/vllm_serve/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/vllm_serve#load-qatptq-model-and-serve-in-vllm-wip>`_ for more details.

0.41 (2026-01-19)
^^^^^^^^^^^^^^^^^
Expand Down
58 changes: 46 additions & 12 deletions examples/vllm_serve/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,11 @@ You can either edit the `quant_config` dictionary in `vllm_serve_fakequant.py`,
|-----------------|--------------------------------------------------|---------------------|
| QUANT_DATASET | Dataset name for calibration | cnn_dailymail |
| QUANT_CALIB_SIZE| Number of samples used for calibration | 512 |
| QUANT_CFG | Quantization format | NVFP4_DEFAULT_CFG |
| KV_QUANT_CFG | Quantization format for KV Cache | None |
| AMAX_FILE_PATH | Optional path to amax file (for loading amax) | None |
| QUANT_CFG | Quantization config | None |
| KV_QUANT_CFG | KV-cache quantization config | None |
| QUANT_FILE_PATH | Optional path to exported quantizer state dict `quantizer_state.pth` | None |
| MODELOPT_STATE_PATH | Optional path to exported `vllm_fq_modelopt_state.pth` (restores quantizer state and parameters) | None |
| CALIB_BATCH_SIZE | Calibration batch size | 1 |

Set these variables in your shell or Docker environment as needed to customize calibration.

Expand Down Expand Up @@ -56,21 +58,53 @@ lm_eval --model local-completions --tasks gsm8k --model_args model=<model_name>,

## Load QAT/PTQ model and serve in vLLM (WIP)

Overwrite the calibrated amax value with prepared values from either QAT/PTQ.
Step 1: export the model with bf16 weights and quantizer state. To export the model:

Step 1: export the model with bf16 weights and amax values. To export the model:
- For **HF** models, use `hf_ptq_export.py`:

- For HF model use `modelopt.torch.export.export_hf_vllm_fq_checkpoint` function.
- For MCore model use `modelopt.torch.export.export_mcore_gpt_to_hf_vllm_fq` function.
```bash
python hf_ptq_export.py\
--pyt_ckpt_path <MODEL_PATH> \
--quant_cfg NVFP4_DEFAULT_CFG \
--export_path <EXPORT_DIR> \
--trust_remote_code
```

This creates `<EXPORT_DIR>/vllm_fq_modelopt_state.pth` (ModelOpt quantizer state for vLLM fake-quant reload) and saves the HF-exported model under `<EXPORT_DIR>` (config/tokenizer/weights).
Note: `--pyt_ckpt_path` can point to either an HF checkpoint or a ModelOpt-saved checkpoint (e.g., a QAT/QAD checkpoint produced by `examples/llm_qat/main.py`). If the input checkpoint is already quantized, the script will **skip re-quantization** and only export artifacts for vLLM fakequant reload.

- For **MCore** models, use `modelopt.torch.export.export_mcore_gpt_to_hf_vllm_fq`:

```python
from modelopt.torch.export import export_mcore_gpt_to_hf_vllm_fq
export_mcore_gpt_to_hf_vllm_fq(
unwrapped_model, # Quantized MCore model
args.pretrained_model_name, # HF model id/path (for config/tokenizer)
export_dir=args.export_dir, # Directory where exported files will be stored
)

```

This generates `quantizer_state.pth`, which contains quantizer tensors for vLLM reload via `QUANT_FILE_PATH`.

Step 2: use the exported artifacts when serving:

- **HF export**: pass the exported `vllm_fq_modelopt_state.pth` via `MODELOPT_STATE_PATH`

```bash
# HF
MODELOPT_STATE_PATH=<vllm_fq_modelopt_state.pth> python vllm_serve_fakequant.py <model_path> -tp 8 --host 0.0.0.0 --port 8000
```

Step 2: configure <quant_amax.pth> from exported model using AMAX_FILE_PATH environment variable in step 1. For example:
- **MCore export**: pass the exported `quantizer_state.pth` via `QUANT_FILE_PATH` and set `QUANT_CFG` to match the MCore quantization recipe

```bash
AMAX_FILE_PATH=<vllm_amax.pth> QUANT_CFG=<quant_config> python vllm_serve_fakequant.py <model_path> -tp 8 --host 0.0.0.0 --port 8000
# MCore
QUANT_CFG=<quant_cfg> QUANT_FILE_PATH=<quantizer_state.pth> python vllm_serve_fakequant.py <model_path> -tp 8 --host 0.0.0.0 --port 8000
```

## Known Problems

1. AWQ is not yet supported in vLLM.
2. QAT checkpoint export doesn't have KV Cache quantization enabled. KV Cache fake quantization works for PTQ.
3. Mixed precision checkpoint doesn't work currently.
1. **MCore reload does not use `MODELOPT_STATE_PATH`**; use `QUANT_FILE_PATH` and make sure `QUANT_CFG` matches the quantization recipe used for the original MCore model (otherwise quantizer keys/config won’t align).
2. AWQ reload is not supported yet
3. KV cache quantization export and reload is not supported in MCore yet.
Loading
Loading