Skip to content

feat(llm): add QAT/Pruning/Sparsity model config support (#115)#126

Merged
m96-chan merged 1 commit intomainfrom
feature/issue-115-qat-pruned-model-support
Dec 30, 2025
Merged

feat(llm): add QAT/Pruning/Sparsity model config support (#115)#126
m96-chan merged 1 commit intomainfrom
feature/issue-115-qat-pruned-model-support

Conversation

@m96-chan
Copy link
Copy Markdown
Owner

Summary

Add support for loading models optimized with QAT, pruning, and sparsity techniques.

Closes #115

New Config Classes

  • QATQuantConfig: Parse QAT/QAD configs from TensorRT Model Optimizer and HuggingFace formats
  • PruningConfig: Detect structured/unstructured pruning configs
  • SparsityConfig: Support 2:4 structured sparsity patterns for Ampere+ TensorCores
  • ModelOptimizationInfo: Aggregate all optimization info

Supported Formats

Format Source Detection
NVFP4/FP8/INT8 TensorRT Model Optimizer producer + quantization fields
AWQ/GPTQ/BNB HuggingFace quantization_config.quant_method
Pruned heads HuggingFace pruned_heads field
2:4 sparsity Various sparsity_config.pattern

Usage Example

from pygpukit.llm import ModelOptimizationInfo

# Parse all optimizations from config.json
opt_info = ModelOptimizationInfo.from_config(config)

if opt_info.has_any_optimization():
    print(opt_info.summary())  # e.g., "FP8(e4m3), Pruned(structured)"

# Check specific optimizations
if opt_info.sparsity_config and opt_info.sparsity_config.is_2_4_sparse():
    print("Model uses 2:4 structured sparsity")

Test plan

  • Unit test config parsing for all formats
  • Pre-commit checks pass (ruff, mypy)
  • Build succeeds

Future Work (out of scope for this PR)

  • Sparse GEMM kernels for 2:4 sparsity
  • QAT weight loading with learned scales
  • TensorRT-LLM checkpoint format support

🤖 Generated with Claude Code

Add support for loading models optimized with QAT, pruning, and sparsity:

New config classes in loader.py:
- QATQuantConfig: Parse QAT/QAD configs from TensorRT Model Optimizer
  and HuggingFace formats (AWQ, GPTQ, BNB, etc.)
- PruningConfig: Detect structured pruning (pruned_heads) and
  unstructured pruning configs
- SparsityConfig: Support 2:4 structured sparsity patterns for
  Ampere+ TensorCores
- ModelOptimizationInfo: Aggregate all optimization info with
  has_any_optimization() and summary() helpers

Supported formats:
- TensorRT Model Optimizer hf_quant_config.json (NVFP4, FP8, INT8)
- HuggingFace quantization_config (AWQ, GPTQ, BNB)
- HuggingFace pruned_heads (structured pruning)
- 2:4 sparsity pattern for sparse TensorCore ops

Reference:
- https://nvidia.github.io/TensorRT-Model-Optimizer/
- https://github.com/huggingface/nn_pruning

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@m96-chan m96-chan merged commit 1e7bfc7 into main Dec 30, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(model): QAT/QAD/Pruned Model Loading Support

1 participant