LLM→SLM Distillation Microservice (Kubernetes‑ready)

Shrink a Large Language Model (LLM, the teacher) into a Small Language Model (SLM, the student) using a hybrid knowledge‑distillation algorithm. This service exposes a REST API, writes artifacts to persistent storage, and ships with Kubernetes manifests for quick deployment.

Status: Developer preview. Verified on Windows (Python 3.13) and Linux. GPU optional.

Architecture

flowchart LR
  %% Client
  U[Client<br/>cURL / CLI / UI] -->|POST /distill| API

  %% Cluster
  subgraph K8s["Kubernetes Cluster"]
    API[FastAPI API<br/>Uvicorn] -->|enqueue/run| JOB[Distillation Job/Worker]
    API --> PVC[(PVC: /data)]
    JOB --> PVC
    CM[(ConfigMap)] -. defaults .-> API
  end

  %% External deps
  JOB -->|Teacher/Student Models| HF[(Hugging Face Hub)]
  JOB -->|Datasets| DS[(Datasets Hub)]

  %% Artifacts
  subgraph Artifacts
    ART[(Student Weights<br/>Tokenizer<br/>int8-dynamic/<br/>model.onnx)]
  end
  PVC --> ART
  API -->|GET /status, /artifacts, /download| ART

Flow:

Client calls POST /distill with teacher/student/dataset config.
API kicks off a distillation run (Job/Worker).
Worker pulls models/datasets, trains student with KD losses.
Artifacts (student weights, tokenizer, optional int8 + ONNX) are persisted to a PVC.
Client fetches status & downloads artifacts via the API.

Why LLM→SLM?

Latency & cost: smaller models are faster and cheaper to serve.
Edge & offline: deploy on constrained hardware.
Domain refresh: frequently update a compact student from an evolving teacher.

Key Features

Hybrid KD loss — temperature‑scaled KL (soft targets) + CE (hard pseudo‑labels) + hidden‑state MSE + cosine alignment; optional attention‑map MSE.
Parameter‑efficient finetuning — optional LoRA on the student.
Quantization & export — post‑training dynamic INT8 and ONNX export (best‑effort).
Kubernetes‑ready — PVC, ConfigMap, Deployment/Service for API, and a one‑off Job.
Hugging Face ecosystem — Transformers/Datasets/Accelerate/PEFT.

Quickstart (Local)

python -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

# Windows: set a writeable artifacts directory
# PowerShell
mkdir data
$env:DATA_DIR = "$PWD/data"

uvicorn app:api --host 0.0.0.0 --port 8000 --reload

Submit a tiny demo job:

curl -X POST http://localhost:8000/distill -H "Content-Type: application/json" -d '{
  "teacher_model": "gpt2",
  "student_model": "distilgpt2",
  "dataset": "wikitext",
  "subset": "wikitext-2-raw-v1",
  "max_steps": 30,
  "per_device_batch_size": 2,
  "grad_accum_steps": 4,
  "logging_steps": 5,
  "save_every": 30,
  "temperature": 2.0
}'

List/download artifacts:

curl http://localhost:8000/artifacts/<job_id>
curl -L "http://localhost:8000/download/<job_id>/config.json" -o config.json

Kubernetes (Demo)

# 1) Build & push
docker build -t <your-registry>/llm2slm:latest .
docker push <your-registry>/llm2slm:latest

# 2) Storage + defaults
kubectl apply -f k8s/pvc.yaml
kubectl apply -f k8s/config.yaml

# 3) API (edit image in k8s/api-deployment.yaml)
kubectl apply -f k8s/api-deployment.yaml
kubectl apply -f k8s/api-service.yaml

# 4) One-off distillation Job (edit image in k8s/distill-job.yaml)
kubectl apply -f k8s/distill-job.yaml

# 5) Port-forward API
kubectl port-forward svc/llm2slm-api 8080:80
# POST to http://localhost:8080/distill

GPU: add resource requests/limits and the NVIDIA device plugin; the code auto‑uses CUDA if available.

API

POST /distill — start a distillation job (payload mirrors schemas.DistillConfig).
GET /status/{job_id} — return current/completed job status.
GET /artifacts/{job_id} — list generated files in the job folder.
GET /download/{job_id}/{path} — download any artifact.

Example payload (abridged):

{
  "teacher_model": "gpt2",
  "student_model": "distilgpt2",
  "dataset": "wikitext",
  "subset": "wikitext-2-raw-v1",
  "split": "train",
  "max_steps": 200,
  "per_device_batch_size": 2,
  "grad_accum_steps": 8,
  "temperature": 2.0,
  "alpha_kl": 0.7,
  "beta_ce": 0.3,
  "gamma_mse": 0.05,
  "delta_cos": 0.05,
  "epsilon_attn": 0.0,
  "lora_r": 8,
  "lora_alpha": 16,
  "lora_dropout": 0.05,
  "use_qat": false,
  "post_training_quant": true,
  "export_onnx": true
}

Algorithm (High‑Level)

1) Token-level KD (soft targets)

For sequence tokens $t=1,\dots,L-1$ (causal shift), with teacher/student logits $z_t^{(T)}, z_t^{(S)} \in \mathbb{R}^{V}$ and temperature $T>0$:

$$ \mathcal{L} = \alpha, \mathcal{L}_{\mathrm{KD}} + \beta, \mathcal{L}_{\mathrm{CE}} + \gamma, \mathcal{L}_{\mathrm{MSE}} + \delta, \mathcal{L}_{\cos} + \varepsilon, \mathcal{L}_{\mathrm{Attn}}. $$

The factor $T^2$ preserves gradient scale under temperature smoothing.

2) Hard pseudo-labels (teacher argmax)

Let $\hat{y}t=\arg\max{v} z_{t,v}^{(T)}$ (no temperature):

$$ \mathcal{L} = \alpha, \mathcal{L}_{\mathrm{KD}} + \beta, \mathcal{L}_{\mathrm{CE}} + \gamma, \mathcal{L}_{\mathrm{MSE}} + \delta, \mathcal{L}_{\cos} + \varepsilon, \mathcal{L}_{\mathrm{Attn}}. $$

3) Hidden-state alignment

Let $h_t^{(T)}, h_t^{(S)} \in \mathbb{R}^{d}$ be last-layer hidden states:

$$ \mathcal{L} = \alpha, \mathcal{L}_{\mathrm{KD}} + \beta, \mathcal{L}_{\mathrm{CE}} + \gamma, \mathcal{L}_{\mathrm{MSE}} + \delta, \mathcal{L}_{\cos} + \varepsilon, \mathcal{L}_{\mathrm{Attn}}. $$

4) Attention-map alignment (optional)

For last-layer attention tensors $A^{(S)}, A^{(T)} \in \mathbb{R}^{H \times L \times L}$:

$$ \mathcal{L} = \alpha, \mathcal{L}_{\mathrm{KD}} + \beta, \mathcal{L}_{\mathrm{CE}} + \gamma, \mathcal{L}_{\mathrm{MSE}} + \delta, \mathcal{L}_{\cos} + \varepsilon, \mathcal{L}_{\mathrm{Attn}}. $$

Causal shift. For auto-regressive LM, losses use $(\text{logits at }t)$ vs $(\text{label }x_{t+1})$. We implement this by slicing logits $[,:!,0{:}L{-}1]$ and labels $[,:!,1{:}L]$.

LoRA (optional). We fine-tune a low-rank adapter on the student; all losses backprop into LoRA parameters (and full weights if enabled).

Symbols

$N$: total token count across batch after causal shift.
$V$: vocab size; $d$: hidden size; $H$: attention heads; $L$: sequence len.
$\alpha,\beta,\gamma,\delta,\varepsilon \ge 0$: loss weights; $T$: temperature.

Typical defaults

$$ T{=}2.0,\quad \alpha{=}0.7,;\beta{=}0.3,;\gamma{=}0.05,;\delta{=}0.05,;\varepsilon{=}0.0. $$

Pseudocode (per step)

# x: input ids, attn_mask: mask
with torch.no_grad():
    t = teacher(x, attention_mask=attn_mask,
                output_hidden_states=True, output_attentions=True)

s = student(x, attention_mask=attn_mask,
            output_hidden_states=True, output_attentions=True)

# Causal shift
logits_s = s.logits[:, :-1, :]
logits_t = t.logits[:, :-1, :]
labels    = x[:, 1:]

# KD (soft)
logp_sT = (logits_s / T).log_softmax(-1)
p_tT    = (logits_t / T).softmax(-1)
L_kd    = kl_div(logp_sT, p_tT, reduction="batchmean") * (T * T)

# CE (hard teacher argmax)
yhat = logits_t.argmax(-1)
L_ce = F.cross_entropy(logits_s.reshape(-1, V), yhat.reshape(-1))

# Hidden alignment
h_s, h_t = s.hidden_states[-1], t.hidden_states[-1]
L_mse = F.mse_loss(h_s, h_t)
L_cos = 1.0 - F.cosine_similarity(h_s, h_t, dim=-1).mean()

# Attention (optional)
L_attn = 0.0
if s.attentions and t.attentions:
    L_attn = F.mse_loss(s.attentions[-1], t.attentions[-1])

loss = α*L_kd + β*L_ce + γ*L_mse + δ*L_cos + ε*L_attn
loss.backward()
optimizer.step(); scheduler.step(); optimizer.zero_grad()

Known Good Dependency Set

Works on Windows (Python 3.13) and Linux:

fastapi==0.115.0
uvicorn[standard]==0.30.6
pydantic==2.9.2
transformers>=4.45,<4.47
tokenizers>=0.20.1
datasets==3.0.1
accelerate==0.34.2
peft==0.13.2
torch>=2.2.0
onnx>=1.17.0
onnxruntime==1.23.0
numpy>=2.0

References

Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv:1503.02531. https://arxiv.org/abs/1503.02531
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT: a distilled version of BERT. arXiv:1910.01108. https://arxiv.org/abs/1910.01108
Jiao, X. et al. (2020). TinyBERT: Distilling BERT for Natural Language Understanding. arXiv:1909.10351. https://arxiv.org/abs/1909.10351
Romero, A. et al. (2015). FitNets: Hints for Thin Deep Nets. arXiv:1412.6550. https://arxiv.org/abs/1412.6550
Hu, E. J. et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685. https://arxiv.org/abs/2106.09685
Jacob, B. et al. (2018). Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. CVPR. arXiv:1712.05877. https://arxiv.org/abs/1712.05877
ONNX – Open Neural Network Exchange. https://onnx.ai/
ONNX Runtime Documentation. https://onnxruntime.ai/docs/
Hugging Face Transformers Docs. https://huggingface.co/docs/transformers
Hugging Face Datasets Docs. https://huggingface.co/docs/datasets
Hugging Face PEFT Docs. https://huggingface.co/docs/peft
Hugging Face Accelerate Docs. https://huggingface.co/docs/accelerate
PyTorch Documentation. https://pytorch.org/docs/stable/
Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS. https://arxiv.org/abs/1706.03762

Author

Freeman Augustus Jackson
Maintainer & original author. For collaboration or questions, please open an issue in the repo (https://github.com/4th/llm2slm) or reach out via GitHub.

Cite this project

If you use this software in your research or products, please cite it.

Plain text
Jackson, Freeman Augustus. LLM→SLM Distillation Microservice. Version 0.1.0. 2025. https://github.com/4th/llm2slm.

BibTeX

@software{jackson_llm2slm_2025,
  author  = {Jackson, Freeman Augustus},
  title   = {LLM→SLM Distillation Microservice},
  year    = {2025},
  version = {0.1.0},
  url     = {https://github.com/4th/llm2slm},
  license = {Apache-2.0}
}

License

Apache‑2.0 (or your preferred license).

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
k8s		k8s
.gitignore		.gitignore
AUTHORS.md		AUTHORS.md
CITATION.cff		CITATION.cff
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
distiller.py		distiller.py
quantize.py		quantize.py
requirements.txt		requirements.txt
schemas.py		schemas.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM→SLM Distillation Microservice (Kubernetes‑ready)

Architecture

Why LLM→SLM?

Key Features

Quickstart (Local)

Kubernetes (Demo)

API

Algorithm (High‑Level)

1) Token-level KD (soft targets)

2) Hard pseudo-labels (teacher argmax)

3) Hidden-state alignment

4) Attention-map alignment (optional)

Symbols

Typical defaults

Pseudocode (per step)

Known Good Dependency Set

References

Author

Cite this project

License

About

Uh oh!

Releases

Packages

Languages

License

4th/llm2slm

Folders and files

Latest commit

History

Repository files navigation

LLM→SLM Distillation Microservice (Kubernetes‑ready)

Architecture

Why LLM→SLM?

Key Features

Quickstart (Local)

Kubernetes (Demo)

API

Algorithm (High‑Level)

1) Token-level KD (soft targets)

2) Hard pseudo-labels (teacher argmax)

3) Hidden-state alignment

4) Attention-map alignment (optional)

Symbols

Typical defaults

Pseudocode (per step)

Known Good Dependency Set

References

Author

Cite this project

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages