feat(llm): add QAT/Pruning/Sparsity model config support (#115) by m96-chan · Pull Request #126 · m96-chan/PyGPUkit

m96-chan · 2025-12-30T02:40:12Z

Summary

Add support for loading models optimized with QAT, pruning, and sparsity techniques.

Closes #115

New Config Classes

QATQuantConfig: Parse QAT/QAD configs from TensorRT Model Optimizer and HuggingFace formats
PruningConfig: Detect structured/unstructured pruning configs
SparsityConfig: Support 2:4 structured sparsity patterns for Ampere+ TensorCores
ModelOptimizationInfo: Aggregate all optimization info

Supported Formats

Format	Source	Detection
NVFP4/FP8/INT8	TensorRT Model Optimizer	`producer` + `quantization` fields
AWQ/GPTQ/BNB	HuggingFace	`quantization_config.quant_method`
Pruned heads	HuggingFace	`pruned_heads` field
2:4 sparsity	Various	`sparsity_config.pattern`

Usage Example

from pygpukit.llm import ModelOptimizationInfo

# Parse all optimizations from config.json
opt_info = ModelOptimizationInfo.from_config(config)

if opt_info.has_any_optimization():
    print(opt_info.summary())  # e.g., "FP8(e4m3), Pruned(structured)"

# Check specific optimizations
if opt_info.sparsity_config and opt_info.sparsity_config.is_2_4_sparse():
    print("Model uses 2:4 structured sparsity")

Test plan

Unit test config parsing for all formats
Pre-commit checks pass (ruff, mypy)
Build succeeds

Future Work (out of scope for this PR)

Sparse GEMM kernels for 2:4 sparsity
QAT weight loading with learned scales
TensorRT-LLM checkpoint format support

🤖 Generated with Claude Code

Add support for loading models optimized with QAT, pruning, and sparsity: New config classes in loader.py: - QATQuantConfig: Parse QAT/QAD configs from TensorRT Model Optimizer and HuggingFace formats (AWQ, GPTQ, BNB, etc.) - PruningConfig: Detect structured pruning (pruned_heads) and unstructured pruning configs - SparsityConfig: Support 2:4 structured sparsity patterns for Ampere+ TensorCores - ModelOptimizationInfo: Aggregate all optimization info with has_any_optimization() and summary() helpers Supported formats: - TensorRT Model Optimizer hf_quant_config.json (NVFP4, FP8, INT8) - HuggingFace quantization_config (AWQ, GPTQ, BNB) - HuggingFace pruned_heads (structured pruning) - 2:4 sparsity pattern for sparse TensorCore ops Reference: - https://nvidia.github.io/TensorRT-Model-Optimizer/ - https://github.com/huggingface/nn_pruning 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

m96-chan merged commit 1e7bfc7 into main Dec 30, 2025
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(llm): add QAT/Pruning/Sparsity model config support (#115)#126

feat(llm): add QAT/Pruning/Sparsity model config support (#115)#126
m96-chan merged 1 commit intomainfrom
feature/issue-115-qat-pruned-model-support

m96-chan commented Dec 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

m96-chan commented Dec 30, 2025

Summary

New Config Classes

Supported Formats

Usage Example

Test plan

Future Work (out of scope for this PR)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant