NVIDIA · kinjalpatel27 · Jan 21, 2026 · Jan 21, 2026 · Jan 22, 2026 · Jan 22, 2026
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -14,6 +14,7 @@ NVIDIA Model Optimizer Changelog (Linux)
 - Add support for Kimi K2 Thinking model quantization from the original int4 checkpoint.
 - Add support for ``params`` constraint based automatic neural architecture search in Minitron pruning (``mcore_minitron``) as an alternative to manual pruning (using ``export_config``). See `examples/pruning/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/pruning>`_ for more details on its usage.
 - Add support for calibration data with multiple samples in ``npz`` format in the ONNX Autocast workflow.
+- Add support for vLLM fakequant reload using ModelOpt state for HF models. See `examples/vllm_serve/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/vllm_serve#load-qatptq-model-and-serve-in-vllm-wip>`_ for more details.
 
 0.41 (2026-01-19)
 ^^^^^^^^^^^^^^^^^

@@ -23,9 +23,11 @@ You can either edit the `quant_config` dictionary in `vllm_serve_fakequant.py`,
 |-----------------|--------------------------------------------------|---------------------|
 | QUANT_DATASET   | Dataset name for calibration                     | cnn_dailymail       |
 | QUANT_CALIB_SIZE| Number of samples used for calibration           | 512                 |
-| QUANT_CFG       | Quantization format                              | NVFP4_DEFAULT_CFG   |
-| KV_QUANT_CFG    | Quantization format for KV Cache                 | None                |
-| AMAX_FILE_PATH  | Optional path to amax file (for loading amax)    | None                |
+| QUANT_CFG       | Quantization config                              | None                |
+| KV_QUANT_CFG    | KV-cache quantization config                     | None                |
+| QUANT_FILE_PATH | Optional path to exported quantizer state dict `quantizer_state.pth` | None |
+| MODELOPT_STATE_PATH | Optional path to exported `vllm_fq_modelopt_state.pth` (restores quantizer state and parameters) | None |
+| CALIB_BATCH_SIZE | Calibration batch size                           | 1                  |
 
 Set these variables in your shell or Docker environment as needed to customize calibration.
 
@@ -56,21 +58,53 @@ lm_eval --model local-completions --tasks gsm8k --model_args model=<model_name>,
 
 ## Load QAT/PTQ model and serve in vLLM (WIP)
 
-Overwrite the calibrated amax value with prepared values from either QAT/PTQ.
+Step 1: export the model with bf16 weights and quantizer state. To export the model:
 
-Step 1: export the model with bf16 weights and amax values. To export the model:
+- For **HF** models, use `hf_ptq_export.py`:
 
-- For HF model use `modelopt.torch.export.export_hf_vllm_fq_checkpoint` function.
-- For MCore model use `modelopt.torch.export.export_mcore_gpt_to_hf_vllm_fq` function.
+```bash
+python  hf_ptq_export.py\
+  --pyt_ckpt_path <MODEL_PATH> \
+  --quant_cfg NVFP4_DEFAULT_CFG \
+  --export_path <EXPORT_DIR> \
+  --trust_remote_code
+```
+
+  This creates `<EXPORT_DIR>/vllm_fq_modelopt_state.pth` (ModelOpt quantizer state for vLLM fake-quant reload) and saves the HF-exported model under `<EXPORT_DIR>` (config/tokenizer/weights).
+  Note: `--pyt_ckpt_path` can point to either an HF checkpoint or a ModelOpt-saved checkpoint (e.g., a QAT/QAD checkpoint produced by `examples/llm_qat/main.py`). If the input checkpoint is already quantized, the script will **skip re-quantization** and only export artifacts for vLLM fakequant reload.
+
+- For **MCore** models, use `modelopt.torch.export.export_mcore_gpt_to_hf_vllm_fq`:
+
+  ```python
+  from modelopt.torch.export import export_mcore_gpt_to_hf_vllm_fq
+  export_mcore_gpt_to_hf_vllm_fq(
+          unwrapped_model,  # Quantized MCore model
+          args.pretrained_model_name,  # HF model id/path (for config/tokenizer)
+          export_dir=args.export_dir,  # Directory where exported files will be stored
+      )
+
+  ```
+
+  This generates `quantizer_state.pth`, which contains quantizer tensors for vLLM reload via `QUANT_FILE_PATH`.
+
+Step 2: use the exported artifacts when serving:
+
+- **HF export**: pass the exported `vllm_fq_modelopt_state.pth` via `MODELOPT_STATE_PATH`
+
+```bash
+# HF
+MODELOPT_STATE_PATH=<vllm_fq_modelopt_state.pth> python vllm_serve_fakequant.py <model_path> -tp 8 --host 0.0.0.0 --port 8000
+```
 
-Step 2: configure <quant_amax.pth> from exported model using AMAX_FILE_PATH environment variable in step 1. For example:
+- **MCore export**: pass the exported `quantizer_state.pth` via `QUANT_FILE_PATH` and set `QUANT_CFG` to match the MCore quantization recipe
 
 ```bash
-AMAX_FILE_PATH=<vllm_amax.pth> QUANT_CFG=<quant_config> python vllm_serve_fakequant.py <model_path> -tp 8 --host 0.0.0.0 --port 8000
+# MCore
+QUANT_CFG=<quant_cfg> QUANT_FILE_PATH=<quantizer_state.pth> python vllm_serve_fakequant.py <model_path> -tp 8 --host 0.0.0.0 --port 8000
 ```
 
 ## Known Problems
 
-1. AWQ is not yet supported in vLLM.
-2. QAT checkpoint export doesn't have KV Cache quantization enabled. KV Cache fake quantization works for PTQ.
-3. Mixed precision checkpoint doesn't work currently.
+1. **MCore reload does not use `MODELOPT_STATE_PATH`**; use `QUANT_FILE_PATH` and make sure `QUANT_CFG` matches the quantization recipe used for the original MCore model (otherwise quantizer keys/config won’t align).
+2. AWQ reload is not supported yet
+3. KV cache quantization export and reload is not supported in MCore yet.