Skip to content

deucebucket/cerebellum

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cerebellum

Ablation-informed mixed-precision quantization for GGUF models.

Instead of applying the same quant level to every tensor, Cerebellum measures the actual sensitivity of each one and allocates bits where they matter.

How It Works

  1. Ablate — crush each tensor individually to Q2_K, measure the perplexity impact
  2. Allocate — sacred tensors (high PPL delta) get promoted to Q6_K/Q8_0, demotable tensors (negative delta) stay at Q2_K, everything else fills in to meet the size budget
  3. Buildllama-quantize --tensor-type @tensor_types.txt applies per-tensor overrides

Install

pip install git+https://github.com/deucebucket/cerebellum.git

Requires PyTorch and Transformers. llama.cpp required for ablation sweep and final quantization.

Usage

1. Generate Importance Matrix (~60 seconds, CPU only)

python -m cerebellum.imatrix_stream \
    --model Qwen/Qwen3.6-27B \
    --output imatrix.dat -v

Computes channel sensitivity directly from weight statistics. No calibration data, no GPU.

2. Run Ablation Sweep

python -m cerebellum.cerebellum ablate \
    --base-gguf model-Q2_K.gguf \
    --tensors ablation_plan.json \
    --output ablation_results.json

Crushes each tensor to Q2_K one at a time, measures the real perplexity delta.

3. Allocate Budget

python -m cerebellum.cerebellum allocate \
    --ablation ablation_results.json \
    --budget 12.0 \
    --output tensor_types.txt

Generates per-tensor quant level assignments for a target file size.

4. Build the GGUF

llama-quantize --imatrix imatrix.dat \
    --tensor-type @tensor_types.txt \
    model-f16.gguf model-cerebellum.gguf Q2_K

Models

License

Apache 2.0

About

Ablation-informed mixed-precision quantization for LLMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages