Skip to content

Conversation

@dsikka
Copy link
Collaborator

@dsikka dsikka commented Nov 4, 2025

Summary

  • Add option to generate mxfp4 scales when calculating qparams depending on qargs
  • Add preset schemes for MXFP4 and MXFP4A16
  • Update mxfp4_packed_compressor to additionally compress the generated scales during compression time. Decompression is not yet supported as general decompression of qparams does not work right now, until this fix lands: [WIP] fix qparams decompression #514

Testing:

Base automatically changed from quant_args_dtype to main November 10, 2025 16:05
@dsikka dsikka marked this pull request as ready for review November 17, 2025 21:06
dsikka added a commit to vllm-project/llm-compressor that referenced this pull request Nov 17, 2025
# Summary
- Requires: vllm-project/compressed-tensors#509
- Add script to generate an mxfp4 quantized model
- This feature is currently experimental as support has not landed or
tested in vLLM

# Testing:
Sample Model: 
- nm-testing/Meta-Llama-3-8B-Instruct-MXFP4

Sample Generation (Transformers):

```bash
========== SAMPLE GENERATION ==============
<|begin_of_text|>Hello my name is Sophia and I am a 3rd year student at the University of California, Berkeley. I am a double major in Linguistics and Psychology, with a minor in Education. I am very interested in the way that language and culture interact, and I believe that education is the key to creating a more just and equitable society.
I am a native speaker of English, and I have also studied Spanish, French, and Mandarin Chinese. I am very interested in the way that language can be used to bring
==========================================

```


Sample Config:
```yaml
"quantization_config": {
    "config_groups": {
        "group_0": {
            "format": "mxfp4-pack-quantized",
            "input_activations": {
                "actorder": null,
                "block_structure": null,
                "dynamic": true,
                "group_size": 32,
                "num_bits": 4,
                "observer": null,
                "observer_kwargs": {},
                "scale_dtype": "torch.uint8",
                "strategy": "group",
                "symmetric": true,
                "type": "float",
                "zp_dtype": null
            },
            "output_activations": null,
            "targets": [
                "Linear"
            ],
            "weights": {
                "actorder": null,
                "block_structure": null,
                "dynamic": false,
                "group_size": 32,
                "num_bits": 4,
                "observer": "minmax",
                "observer_kwargs": {},
                "scale_dtype": "torch.uint8",
                "strategy": "group",
                "symmetric": true,
                "type": "float",
                "zp_dtype": null
                }
        }
    },
    "format": "mxfp4-pack-quantized",
}


```

---------

Signed-off-by: Dipika Sikka <ds3822@columbia.edu>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants