Ablation-informed mixed-precision quantization for GGUF models.
Instead of applying the same quant level to every tensor, Cerebellum measures the actual sensitivity of each one and allocates bits where they matter.
- Ablate — crush each tensor individually to Q2_K, measure the perplexity impact
- Allocate — sacred tensors (high PPL delta) get promoted to Q6_K/Q8_0, demotable tensors (negative delta) stay at Q2_K, everything else fills in to meet the size budget
- Build —
llama-quantize --tensor-type @tensor_types.txtapplies per-tensor overrides
pip install git+https://github.com/deucebucket/cerebellum.gitRequires PyTorch and Transformers. llama.cpp required for ablation sweep and final quantization.
python -m cerebellum.imatrix_stream \
--model Qwen/Qwen3.6-27B \
--output imatrix.dat -vComputes channel sensitivity directly from weight statistics. No calibration data, no GPU.
python -m cerebellum.cerebellum ablate \
--base-gguf model-Q2_K.gguf \
--tensors ablation_plan.json \
--output ablation_results.jsonCrushes each tensor to Q2_K one at a time, measures the real perplexity delta.
python -m cerebellum.cerebellum allocate \
--ablation ablation_results.json \
--budget 12.0 \
--output tensor_types.txtGenerates per-tensor quant level assignments for a target file size.
llama-quantize --imatrix imatrix.dat \
--tensor-type @tensor_types.txt \
model-f16.gguf model-cerebellum.gguf Q2_K- Qwen3.6-27B-Cerebellum-v4-GGUF — 12 GB, PPL 7.034, 181 overrides
- Qwen3.6-27B-Osmosis-Q2_K-GGUF — 10 GB, PPL 7.500, imatrix baseline
Apache 2.0