This project fine-tunes the DeepSeek-R1-Distill-Llama-8B model on the medical-o1 dataset using efficient methods like QLoRA, LoRA adapters, and the Unsloth framework. The goal is to enhance clinical reasoning capabilities in medical QA systems while reducing memory and compute requirements.
You can view and run this project directly on Google Colab
LoRA is a parameter-efficient fine-tuning method that injects learnable adapters into frozen model layers. It drastically reduces the number of trainable parameters.
- Rank (
r): 16 - Alpha: 16
- Dropout: 0.05
- Benefits:
- No need to update the full base model
- Smaller memory footprint
- Faster training, easier adapter sharing (
.safetensors)
Quantization is a technique that reduces the precision of a modelβs weights from high-precision formats (like 32-bit float) to lower-precision formats (like 4-bit integers), significantly cutting down memory usage.
In this project, we apply 4-bit quantization using load_in_4bit=True to make it feasible to fine-tune and run an 8B parameter model on limited hardware.
Large Language Models (LLMs) contain billions of parameters. Storing and manipulating these parameters in full precision (FP32) consumes a huge amount of memory.
Quantization addresses this by:
- Replacing high-precision weights (e.g., 32-bit floats) with compact 4-bit integers.
- Storing scaling factors and lookup tables to map back to approximate original values.
Example:
Original weight: [0.123456, -0.987654, 1.234567] # FP32
Quantized: [6, -8, 12] # INT4 with scale
| Benefit | Explanation |
|---|---|
| πΎ Reduced VRAM Usage | 4-bit weights use ~8Γ less memory than FP32 |
| β‘ Faster Training/Inference | Smaller matrices = faster operations |
| πΈ Lower Compute Cost | Enables training on free/Pro Colab GPUs |
| π€ Compatible with LoRA | Works seamlessly with parameter-efficient fine-tuning |
Without quantization, training a model like DeepSeek-R1-Distill-Llama-8B (~8B parameters) would require >24GB of GPU VRAM. Using quantization:
load_in_4bit = True⦠you enable memory-efficient training with:
- π Unsloth
- π LoRA adapters
- π Limited compute budgets
This technique is part of the QLoRA approach, allowing high-performing fine-tuning with low resource requirements.
Unsloth is an optimized backend for Hugging Face Transformers that enables efficient training of large language models with LoRA and quantization.
- Faster downloads
- Lower memory usage
- Accelerated training loop
from unsloth import FastLanguageModelFastLanguageModel is an enhanced wrapper for Hugging Face models.
Used for:
- Loading quantized models
- Preparing them for LoRA
- Training with memory-efficient ops
FastLanguageModel.from_pretrained(...)
FastLanguageModel.get_peft_model(...)
FastLanguageModel.prepare_model_for_training(...)𧬠Dataset Used: FreedomIntelligence/medical-o1-reasoning-SFT
π This dataset is a supervised fine-tuning (SFT) dataset specifically designed to evaluate and train LLMs on complex medical reasoning, including diagnosis, explanation, and treatment recommendation.
| Field | Description |
|---|---|
| π Name | medical-o1-reasoning-SFT |
| π§ͺ Source | FreedomIntelligence on Hugging Face |
| π§ Focus | Medical question-answering and reasoning |
| π§Ύ Format | JSONL / Hugging Face Datasets format |
| π§ Fields | instruction, input, output |
| π Type | Instruction-tuned (SFT) |
| π Size | ~10,000 examples (approx.) |
| π©Ί Domain | Clinical QA, Diagnosis, Medical Education |
| ποΈ License | Apache 2.0 |
{
"instruction": "Explain the pathophysiology of Type 1 Diabetes.",
"input": "",
"output": "Type 1 Diabetes is caused by autoimmune destruction of pancreatic beta cells..."
}- Emphasizes chain-of-thought medical reasoning
- Structured for instruction tuning, compatible with LoRA + QLoRA
- Pairs well with models like
DeepSeek-R1due to its distilled instruction format
We load and preprocess it using:
from datasets import load_dataset
dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", split="train")And format it to this template:
### Instruction:
{instruction}
### Input:
{input}
### Response:
{output}
@misc{freedomintelligence2024medicalo1,
title={Medical O1 Reasoning Dataset},
author={FreedomIntelligence},
year={2024},
howpublished={\url{https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT}},
}.
βββ notebooks/
β βββ Fine_Tuning_DEEPSEEK_R1.ipynb # Full pipeline
βββ models/
β βββ adapter_model.safetensors # LoRA weights only
βββ assets/
β βββ pipeline_overview.png # Image showing full flow
βββ README.md
This section illustrates the end-to-end fine-tuning pipeline using DeepSeek-R1-Distill-Llama-8B, the medical-o1 dataset, and efficient training strategies like LoRA and quantization β all implemented through the Unsloth framework.
| Step | Description |
|---|---|
| π§ 1. Load Base Model | Initialize DeepSeek-R1-Distill-Llama-8B with 4-bit quantization using:load_in_4bit=TrueThis reduces memory usage drastically while maintaining performance. |
| ποΈ 2. Load & Format Dataset | Use the medical-o1 dataset containing:- instruction (task)- input (context)- output (expected answer)All samples are converted to a standard prompt format: ### Instruction: β¦ ### Input: β¦ ### Response: |
| π§© 3. Inject LoRA Adapters | With FastLanguageModel.get_peft_model(), LoRA adapters are inserted into transformer layers.LoRA hyperparameters: r=16, alpha=16, dropout=0.05 |
| π§ͺ 4. Fine-Tune the Model | The model is fine-tuned using: β AdamW optimizer β Linear learning rate scheduler β 3 epochs, batch size = 2 β Trained on CUDA (GPU) All while only updating LoRA parameters. |
| πΎ 5. Save Adapters | After training, only the LoRA adapters are saved in .safetensors format:models/adapter_model.safetensorsThis is lightweight (~100MB) and reusable. |
| π§ 6. Inference & Evaluation | Perform pre/post fine-tuning inference on medical queries: β See how model reasoning improves β Compare medical accuracy, depth, and relevance |
ββββββββββββββββ ββββββββββββββββββ βββββββββββββββββ
β Base Model β βββΆ β Quantize (4-bit)β βββΆ β Load Dataset β
ββββββββββββββββ ββββββββββββββββββ βββββββ¬ββββββββββ
βΌ
ββββββββββββββββββββββ
β Format Prompts β
ββββββββββ¬ββββββββββββ
βΌ
βββββββββββββββββββββββ
β Apply LoRA Adapters β
ββββββββββ¬βββββββββββββ
βΌ
ββββββββββββββββββββββββ
β Fine-Tune (3 Epochs) β
ββββββββββ¬ββββββββββββββ
βΌ
ββββββββββββββββββββββββββ
β Save Adapter Weights β
ββββββββββ¬ββββββββββββββββ
βΌ
ββββββββββββββββββββββββββββββ
β Post-Tune Medical Inferenceβ
ββββββββββββββββββββββββββββββ
git clone https://github.com/soham-kar/deepseek.git
cd deepseek
pip install -r requirements.txtDependencies (example)
transformers>=4.40.0
accelerate
unsloth
datasets
bitsandbytes
peftfrom transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(
"deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
load_in_4bit=True, device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "models/adapter_model.safetensors")
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Llama-8B")
prompt = "### Instruction:\nExplain the mechanism of insulin resistance.\n### Response:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))| Metric | Pre-Fine-Tuning | Post-Fine-Tuning |
|---|---|---|
| Medical Accuracy | β Generic | β Specific & Domain-aware |
| Clinical Reasoning | β Surface-Level | β Step-by-Step Reasoning |
| Inference Coherence | β Consistent |
This section highlights the impact of fine-tuning. The base model produces generic, high-level responses, while the fine-tuned model demonstrates deeper clinical understanding.
| Prompt | π§ͺ Base Model Response | β Fine-Tuned Model Response |
|---|---|---|
| Explain Type 2 Diabetes | "It is a disease affecting blood sugar..." | "Type 2 Diabetes is characterized by insulin resistance, where the body's cells do not respond to insulin effectively..." |
| What is insulin resistance? | "Insulin helps manage blood sugar..." | "Insulin resistance occurs when muscle, fat, and liver cells fail to respond properly to insulin, leading to hyperglycemia..." |
β Observation: After fine-tuning, the model shows improved domain alignment with accurate medical terminology, structured reasoning, and fewer generic statements.
- Merge LoRA weights into base model for export
- Build interactive demo with Streamlit/Gradio
- Experiment with
medical-mcqaandpubmedqa

