From d3fefc9b06f97cee3190fdca839ff7e402285fe3 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Mon, 19 Jan 2026 21:38:09 +0000
Subject: [PATCH 1/3] Add comprehensive fine-tuning plan for gpt-oss-20b wiki
 agent
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This adds detailed planning documents for fine-tuning OpenAI's gpt-oss-20b
to become a specialized architectural wiki agent for SemanticWiki local mode.

Documents include:
- Dataset preparation: CodeWikiBench, DeepWiki, synthetic data generation
- Fine-tuning execution: LoRA config, hyperparameters, training scripts
- Evaluation: automated metrics, CodeWikiBench, task-specific evals

Target improvements over base model:
- Source traceability: 50% → 90%+
- Mermaid diagram validity: 70% → 95%+
- Wiki completeness: 60% → 90%+
---
 fine-tuning/01-DATASET-PREPARATION.md   | 534 ++++++++++++++++
 fine-tuning/02-FINE-TUNING-EXECUTION.md | 696 +++++++++++++++++++++
 fine-tuning/03-EVALUATION.md            | 792 ++++++++++++++++++++++++
 fine-tuning/README.md                   | 312 ++++++++++
 4 files changed, 2334 insertions(+)
 create mode 100644 fine-tuning/01-DATASET-PREPARATION.md
 create mode 100644 fine-tuning/02-FINE-TUNING-EXECUTION.md
 create mode 100644 fine-tuning/03-EVALUATION.md
 create mode 100644 fine-tuning/README.md

diff --git a/fine-tuning/01-DATASET-PREPARATION.md b/fine-tuning/01-DATASET-PREPARATION.md
new file mode 100644
index 0000000..faf0d62
--- /dev/null
+++ b/fine-tuning/01-DATASET-PREPARATION.md
@@ -0,0 +1,534 @@
+# Dataset Preparation Plan
+
+This document outlines the strategy for preparing training data to fine-tune gpt-oss-20b as an architectural wiki agent for SemanticWiki in local-only mode.
+
+## Overview
+
+The training dataset will combine three sources:
+1. **Real examples** from CodeWikiBench and DeepWiki
+2. **Synthetic data** generated via LLM distillation
+3. **SemanticWiki-specific examples** from existing wiki generations
+
+Target dataset size: **10,000-50,000 high-quality examples**
+
+---
+
+## 1. Real Data Sources
+
+### 1.1 CodeWikiBench Dataset
+
+**Source:** [HuggingFace - anhnh2002/codewikibench](https://huggingface.co/datasets/anhnh2002/codewikibench)
+
+CodeWikiBench provides repository-level documentation examples across 22 open-source projects in 6 languages.
+
+#### Dataset Structure
+```python
+from datasets import load_dataset
+
+dataset = load_dataset("anhnh2002/codewikibench")
+# Each entry contains:
+# - repo_name: Repository identifier
+# - commit_id: Specific commit hash
+# - docs_tree: Original documentation structure
+# - structured_docs: Parsed documentation content
+# - rubrics: Quality evaluation criteria
+```
+
+#### Extraction Strategy
+```python
+# Extract high-quality documentation examples
+for repo in dataset['train']:
+    examples = []
+
+    # Extract architecture documentation
+    arch_docs = extract_architecture_sections(repo['structured_docs'])
+
+    # Extract component documentation with source refs
+    component_docs = extract_component_docs(repo['structured_docs'])
+
+    # Create training pairs: (code_context, documentation)
+    for doc in arch_docs + component_docs:
+        examples.append({
+            "instruction": generate_instruction(doc),
+            "input": extract_code_context(repo, doc),
+            "output": doc['content']
+        })
+```
+
+#### Languages Covered
+| Language | Repositories | Examples |
+|----------|-------------|----------|
+| JavaScript/TypeScript | Chart.js, puppeteer, mermaid, svelte, marktext, storybook | ~3,000 |
+| Python | graphrag, rasa, OpenHands | ~1,500 |
+| C/C++ | electron, qmk_firmware, libsql, json, x64dbg | ~2,000 |
+| C# | FluentValidation, ml-agents, git-credential-manager | ~1,000 |
+| Java | logstash, trino, material-components-android | ~1,000 |
+
+### 1.2 DeepWiki Crawled Data
+
+**Source:** [DeepWiki](https://deepwiki.org/) - AI-generated documentation for 30,000+ GitHub repositories
+
+#### Crawling Strategy
+```python
+import requests
+from bs4 import BeautifulSoup
+
+def crawl_deepwiki(repo_owner: str, repo_name: str) -> dict:
+    """
+    Crawl DeepWiki documentation for a repository.
+    Replace 'github.com' with 'deepwiki.com' in URL.
+    """
+    base_url = f"https://deepwiki.com/{repo_owner}/{repo_name}"
+
+    # Fetch main documentation
+    response = requests.get(base_url)
+    soup = BeautifulSoup(response.text, 'html.parser')
+
+    return {
+        "overview": extract_section(soup, "overview"),
+        "architecture": extract_section(soup, "architecture"),
+        "components": extract_section(soup, "components"),
+        "data_flow": extract_section(soup, "data-flow")
+    }
+
+# Target repositories (popular, well-structured projects)
+TARGET_REPOS = [
+    ("facebook", "react"),
+    ("vuejs", "vue"),
+    ("microsoft", "vscode"),
+    ("tensorflow", "tensorflow"),
+    # ... 500+ curated repositories
+]
+```
+
+#### Data Quality Filters
+- Minimum 1,000 lines of code in repository
+- Documentation must include architecture diagrams
+- Must have source code references
+- Exclude auto-generated API docs (focus on conceptual docs)
+
+### 1.3 OpenDeepWiki (Open Source Alternative)
+
+**Source:** [GitHub - AIDotNet/OpenDeepWiki](https://github.com/AIDotNet/OpenDeepWiki)
+
+For repositories where DeepWiki access is limited, use OpenDeepWiki to generate documentation locally.
+
+---
+
+## 2. Synthetic Data Generation
+
+### 2.1 Distillation from Claude/GPT-4
+
+Use a stronger model to generate high-quality documentation examples.
+
+#### Generation Pipeline
+```python
+from anthropic import Anthropic
+
+client = Anthropic()
+
+def generate_synthetic_example(code_files: list[str], repo_metadata: dict) -> dict:
+    """
+    Generate synthetic architectural documentation using Claude.
+    """
+    prompt = f"""
+    You are an expert software architect creating documentation for a wiki.
+
+    Repository: {repo_metadata['name']}
+    Language: {repo_metadata['language']}
+
+    Code files:
+    {format_code_files(code_files)}
+
+    Generate comprehensive architectural documentation including:
+    1. System overview with file:line references
+    2. Component descriptions with source traceability
+    3. Data flow explanation
+    4. Mermaid diagram for architecture
+
+    Format as markdown with `file:line` references for every concept.
+    """
+
+    response = client.messages.create(
+        model="claude-sonnet-4-20250514",
+        max_tokens=8000,
+        messages=[{"role": "user", "content": prompt}]
+    )
+
+    return {
+        "instruction": "Generate architectural wiki documentation for this codebase",
+        "input": format_code_files(code_files),
+        "output": response.content[0].text
+    }
+```
+
+#### Synthetic Data Categories
+
+| Category | Description | Target Count |
+|----------|-------------|--------------|
+| Architecture Overview | High-level system design docs | 5,000 |
+| Component Documentation | Individual module docs | 10,000 |
+| Data Flow Documentation | Request/data lifecycle docs | 3,000 |
+| Getting Started Guides | Onboarding documentation | 2,000 |
+| Business Domain Mapping | Technical-to-business docs | 2,000 |
+| Mermaid Diagram Generation | Architecture diagrams | 5,000 |
+| Source Traceability Examples | `file:line` reference patterns | 3,000 |
+
+### 2.2 Self-Instruct Method
+
+Generate instruction-following examples by:
+1. Seeding with 100 manually-crafted high-quality examples
+2. Using gpt-oss-20b (base) to generate variations
+3. Filtering with Claude for quality
+
+```python
+def self_instruct_generation(seed_examples: list, num_generate: int = 1000):
+    """
+    Self-instruct style data augmentation.
+    """
+    generated = []
+
+    for _ in range(num_generate):
+        # Sample seed examples for context
+        context_examples = random.sample(seed_examples, k=3)
+
+        # Generate new instruction
+        new_instruction = generate_instruction_variation(context_examples)
+
+        # Generate response
+        response = base_model.generate(new_instruction)
+
+        # Quality filter with teacher model
+        if quality_check(new_instruction, response):
+            generated.append({
+                "instruction": new_instruction,
+                "output": response
+            })
+
+    return generated
+```
+
+### 2.3 Code-to-Documentation Pairs
+
+Extract from existing well-documented repositories:
+
+```python
+def extract_code_doc_pairs(repo_path: str) -> list[dict]:
+    """
+    Extract code-documentation pairs from repositories
+    with inline documentation or adjacent .md files.
+    """
+    pairs = []
+
+    # Find code files with documentation
+    for code_file in glob.glob(f"{repo_path}/**/*.ts", recursive=True):
+        doc_file = code_file.replace('.ts', '.md')
+
+        if os.path.exists(doc_file):
+            pairs.append({
+                "code": read_file(code_file),
+                "documentation": read_file(doc_file),
+                "file_path": code_file
+            })
+
+    return pairs
+```
+
+---
+
+## 3. SemanticWiki-Specific Data
+
+### 3.1 Tool Use Trajectories
+
+Capture successful wiki generation sessions:
+
+```python
+# Format: instruction -> tool calls -> final documentation
+
+TOOL_USE_EXAMPLE = {
+    "instruction": "Generate architecture documentation for the authentication module",
+    "trajectory": [
+        {"tool": "search_codebase", "input": "authentication login", "output": "[results]"},
+        {"tool": "read_file", "input": "src/auth/provider.ts", "output": "[code]"},
+        {"tool": "analyze_code_structure", "input": "src/auth/", "output": "[analysis]"},
+        {"tool": "write_wiki_page", "input": {"path": "auth/overview.md", "content": "..."}}
+    ],
+    "final_output": "# Authentication Module\n\n..."
+}
+```
+
+### 3.2 Multi-Turn Conversations
+
+Document iterative refinement patterns:
+
+```python
+MULTI_TURN_EXAMPLE = {
+    "turns": [
+        {"user": "Document the payment processing flow", "assistant": "[initial doc]"},
+        {"user": "Add more detail about error handling", "assistant": "[refined doc]"},
+        {"user": "Include sequence diagram", "assistant": "[doc with mermaid]"}
+    ]
+}
+```
+
+### 3.3 Source Traceability Training
+
+Explicit training on `file:line` reference generation:
+
+```python
+TRACEABILITY_EXAMPLE = {
+    "instruction": "Add source references to this documentation",
+    "input": """
+    The UserService handles user authentication by validating credentials
+    against the database and generating JWT tokens.
+    """,
+    "output": """
+    The `UserService` handles user authentication by validating credentials
+    against the database ([`src/services/user.ts:45-67`](../src/services/user.ts#L45-L67))
+    and generating JWT tokens ([`src/auth/jwt.ts:23-41`](../src/auth/jwt.ts#L23-L41)).
+    """
+}
+```
+
+---
+
+## 4. Data Format
+
+### 4.1 Harmony Format for gpt-oss
+
+gpt-oss models require the [Harmony response format](https://github.com/openai/harmony).
+
+```python
+from openai_harmony import Renderer
+
+renderer = Renderer()
+
+def format_for_harmony(example: dict) -> str:
+    """
+    Convert example to Harmony format for gpt-oss training.
+    """
+    messages = [
+        {
+            "role": "system",
+            "content": WIKI_AGENT_SYSTEM_PROMPT,
+            "channel": "final"
+        },
+        {
+            "role": "user",
+            "content": example["instruction"],
+            "channel": "final"
+        }
+    ]
+
+    # Add tool calls if present
+    if "trajectory" in example:
+        for step in example["trajectory"]:
+            messages.append({
+                "role": "assistant",
+                "content": json.dumps(step),
+                "channel": "tool_call"
+            })
+
+    # Final response
+    messages.append({
+        "role": "assistant",
+        "content": example["output"],
+        "channel": "final"
+    })
+
+    return renderer.render(messages)
+```
+
+### 4.2 Alternative: ChatML Format (for Ollama/vLLM)
+
+```python
+def format_chatml(example: dict) -> str:
+    """
+    Standard ChatML format for broader compatibility.
+    """
+    return f"""<|im_start|>system
+{WIKI_AGENT_SYSTEM_PROMPT}
+<|im_end|>
+<|im_start|>user
+{example["instruction"]}
+
+{example.get("input", "")}
+<|im_end|>
+<|im_start|>assistant
+{example["output"]}
+<|im_end|>"""
+```
+
+### 4.3 JSONL Output Format
+
+Final training data format:
+
+```jsonl
+{"text": "<harmony formatted conversation>", "source": "codewikibench", "category": "architecture"}
+{"text": "<harmony formatted conversation>", "source": "synthetic", "category": "component"}
+{"text": "<harmony formatted conversation>", "source": "deepwiki", "category": "data_flow"}
+```
+
+---
+
+## 5. Data Quality Assurance
+
+### 5.1 Automated Quality Checks
+
+```python
+def quality_check(example: dict) -> bool:
+    """
+    Validate training example quality.
+    """
+    checks = [
+        # Must have source references
+        has_source_references(example["output"]),
+
+        # Minimum content length
+        len(example["output"]) >= 500,
+
+        # Valid markdown
+        is_valid_markdown(example["output"]),
+
+        # No hallucinated file paths
+        validate_file_references(example),
+
+        # Proper mermaid syntax (if diagrams present)
+        validate_mermaid_diagrams(example["output"]),
+    ]
+
+    return all(checks)
+
+def has_source_references(text: str) -> bool:
+    """Check for file:line reference patterns."""
+    pattern = r'`[a-zA-Z0-9/_.-]+:\d+(-\d+)?`'
+    return bool(re.search(pattern, text))
+```
+
+### 5.2 Human Review Sample
+
+- Review 5% of synthetic data manually
+- Use LLM-as-judge for automated quality scoring
+- Track quality metrics per data source
+
+### 5.3 Deduplication
+
+```python
+from datasketch import MinHash, MinHashLSH
+
+def deduplicate_dataset(examples: list[dict]) -> list[dict]:
+    """
+    Remove near-duplicate examples using MinHash LSH.
+    """
+    lsh = MinHashLSH(threshold=0.8, num_perm=128)
+    unique_examples = []
+
+    for i, example in enumerate(examples):
+        minhash = compute_minhash(example["output"])
+
+        if not lsh.query(minhash):
+            lsh.insert(f"doc_{i}", minhash)
+            unique_examples.append(example)
+
+    return unique_examples
+```
+
+---
+
+## 6. Dataset Splits
+
+| Split | Percentage | Purpose |
+|-------|------------|---------|
+| Train | 90% | Fine-tuning |
+| Validation | 5% | Hyperparameter tuning |
+| Test | 5% | Final evaluation |
+
+### Stratification
+
+Ensure balanced representation across:
+- Programming languages (TypeScript, Python, Java, C++, etc.)
+- Documentation types (architecture, component, data flow, guides)
+- Repository sizes (small <10K LOC, medium 10-100K, large >100K)
+- Data sources (real vs synthetic)
+
+---
+
+## 7. Data Pipeline Implementation
+
+### 7.1 Directory Structure
+
+```
+fine-tuning/
+├── data/
+│   ├── raw/
+│   │   ├── codewikibench/
+│   │   ├── deepwiki/
+│   │   └── synthetic/
+│   ├── processed/
+│   │   ├── train.jsonl
+│   │   ├── validation.jsonl
+│   │   └── test.jsonl
+│   └── quality_reports/
+├── scripts/
+│   ├── crawl_deepwiki.py
+│   ├── process_codewikibench.py
+│   ├── generate_synthetic.py
+│   ├── format_harmony.py
+│   └── quality_check.py
+└── configs/
+    └── data_config.yaml
+```
+
+### 7.2 Pipeline Commands
+
+```bash
+# Step 1: Download CodeWikiBench
+python scripts/process_codewikibench.py --output data/raw/codewikibench/
+
+# Step 2: Crawl DeepWiki (respect rate limits)
+python scripts/crawl_deepwiki.py --repos repos.txt --output data/raw/deepwiki/
+
+# Step 3: Generate synthetic data
+python scripts/generate_synthetic.py \
+  --source-repos /path/to/repos \
+  --num-examples 20000 \
+  --output data/raw/synthetic/
+
+# Step 4: Format for Harmony
+python scripts/format_harmony.py \
+  --input data/raw/ \
+  --output data/processed/
+
+# Step 5: Quality check and split
+python scripts/quality_check.py \
+  --input data/processed/ \
+  --output data/processed/ \
+  --train-ratio 0.9 \
+  --val-ratio 0.05
+```
+
+---
+
+## 8. Estimated Timeline & Resources
+
+| Phase | Duration | Compute Required |
+|-------|----------|------------------|
+| CodeWikiBench processing | 2-4 hours | CPU only |
+| DeepWiki crawling | 1-2 days | CPU + network |
+| Synthetic generation | 2-3 days | API calls (~$200-500) |
+| Quality filtering | 4-8 hours | CPU/GPU for embeddings |
+| Formatting & splitting | 1-2 hours | CPU only |
+
+**Total estimated cost:** $300-600 (primarily synthetic generation API costs)
+
+---
+
+## 9. References
+
+- [CodeWikiBench Dataset](https://huggingface.co/datasets/anhnh2002/codewikibench)
+- [CodeWiki Paper (arXiv:2510.24428)](https://arxiv.org/abs/2510.24428)
+- [DeepWiki](https://deepwiki.org/)
+- [OpenDeepWiki](https://github.com/AIDotNet/OpenDeepWiki)
+- [Harmony Response Format](https://github.com/openai/harmony)
+- [Synthetic Data Generation Survey](https://arxiv.org/abs/2503.14023)
+- [LLM-Synthetic-Data Reading List](https://github.com/pengr/LLM-Synthetic-Data)
diff --git a/fine-tuning/02-FINE-TUNING-EXECUTION.md b/fine-tuning/02-FINE-TUNING-EXECUTION.md
new file mode 100644
index 0000000..0f41124
--- /dev/null
+++ b/fine-tuning/02-FINE-TUNING-EXECUTION.md
@@ -0,0 +1,696 @@
+# Fine-Tuning Execution Plan
+
+This document details the procedure for fine-tuning gpt-oss-20b to become a specialized architectural wiki agent for SemanticWiki.
+
+## Overview
+
+### Model Specifications
+
+| Property | Value |
+|----------|-------|
+| Base Model | [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) |
+| Architecture | Mixture-of-Experts (MoE) Transformer |
+| Total Parameters | 20.9B |
+| Active Parameters | 3.6B per token |
+| MoE Experts | 32 experts, Top-4 routing |
+| Context Length | 128K tokens (native) |
+| Quantization | MXFP4 (4.25 bits per parameter) |
+| License | Apache 2.0 |
+
+### Fine-Tuning Approach
+
+We will use **LoRA (Low-Rank Adaptation)** for parameter-efficient fine-tuning:
+
+- **Why LoRA:** Reduces memory from 65GB+ to 14-16GB VRAM
+- **Target:** Attention and MoE expert layers
+- **Expected improvement:** Task-specific optimization without catastrophic forgetting
+
+---
+
+## 1. Hardware Requirements
+
+### Recommended Configurations
+
+| Configuration | GPU | VRAM | Training Time (20K examples) | Cost Estimate |
+|---------------|-----|------|------------------------------|---------------|
+| **Optimal** | H100 SXM 80GB | 80GB | 17-20 minutes | ~$3-5/run |
+| **Good** | A100 80GB | 80GB | 25-35 minutes | ~$4-6/run |
+| **Acceptable** | RTX 4090 24GB | 24GB | 60-90 minutes | Consumer HW |
+| **Budget** | RTX 3090 24GB | 24GB | 90-120 minutes | Consumer HW |
+
+### Cloud GPU Options
+
+```bash
+# RunPod (recommended for quick experiments)
+# H100 SXM: ~$3.89/hr
+runpod create --gpu H100_SXM --template pytorch
+
+# Lambda Labs
+# H100: ~$2.49/hr (when available)
+
+# AWS (SageMaker)
+# ml.p5.xlarge (H100): ~$10.98/hr
+
+# Google Cloud (Vertex AI)
+# a3-highgpu-1g (H100): ~$5.07/hr
+```
+
+### Memory Requirements by Method
+
+| Method | VRAM Required | Notes |
+|--------|---------------|-------|
+| Full Fine-tuning | 300GB+ | Multi-GPU required |
+| BF16 LoRA | 44GB | Standard training |
+| QLoRA (4-bit) | 14-16GB | Unsloth optimized |
+| MXFP4 Native | 16GB | gpt-oss native format |
+
+---
+
+## 2. Environment Setup
+
+### 2.1 Dependencies
+
+```bash
+# Create virtual environment
+python -m venv venv
+source venv/bin/activate
+
+# Core dependencies
+pip install torch>=2.1.0 --index-url https://download.pytorch.org/whl/cu121
+pip install transformers>=4.40.0
+pip install accelerate>=0.27.0
+pip install peft>=0.10.0
+pip install trl>=0.8.0
+pip install bitsandbytes>=0.43.0
+pip install datasets>=2.18.0
+
+# gpt-oss specific
+pip install openai-harmony  # Harmony format support
+
+# Optional: Unsloth for memory optimization
+pip install unsloth
+```
+
+### 2.2 requirements.txt
+
+```text
+torch>=2.1.0
+transformers>=4.40.0
+accelerate>=0.27.0
+peft>=0.10.0
+trl>=0.8.0
+bitsandbytes>=0.43.0
+datasets>=2.18.0
+openai-harmony>=1.0.0
+wandb>=0.16.0
+tensorboard>=2.16.0
+einops>=0.7.0
+flash-attn>=2.5.0
+```
+
+### 2.3 Model Download
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model_id = "openai/gpt-oss-20b"
+
+# Download model (will use MXFP4 weights)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype="auto",
+    device_map="auto",
+    trust_remote_code=True
+)
+```
+
+---
+
+## 3. LoRA Configuration
+
+### 3.1 Target Modules
+
+gpt-oss-20b uses MoE architecture. Target both attention and expert layers:
+
+```python
+from peft import LoraConfig, get_peft_model
+
+lora_config = LoraConfig(
+    r=16,                    # Rank (8, 16, 32, 64 common choices)
+    lora_alpha=32,           # Scaling factor (typically 2x rank)
+    lora_dropout=0.05,       # Dropout for regularization
+
+    # Target modules for gpt-oss MoE
+    target_modules=[
+        # Attention layers
+        "q_proj",
+        "k_proj",
+        "v_proj",
+        "o_proj",
+
+        # MoE expert layers (critical for task adaptation)
+        "gate_proj",
+        "up_proj",
+        "down_proj",
+
+        # Router (optional, for expert selection tuning)
+        # "router",
+    ],
+
+    bias="none",
+    task_type="CAUSAL_LM",
+)
+
+model = get_peft_model(model, lora_config)
+model.print_trainable_parameters()
+# Expected: trainable params: ~50M / 20.9B total (0.24%)
+```
+
+### 3.2 Rank Selection Guide
+
+| LoRA Rank | Trainable Params | VRAM Impact | Use Case |
+|-----------|------------------|-------------|----------|
+| r=8 | ~25M | Minimal | Quick experiments |
+| r=16 | ~50M | Low | **Recommended starting point** |
+| r=32 | ~100M | Moderate | Complex task adaptation |
+| r=64 | ~200M | Higher | Maximum expressiveness |
+
+---
+
+## 4. Training Configuration
+
+### 4.1 Hyperparameters
+
+```python
+from trl import SFTConfig, SFTTrainer
+
+training_args = SFTConfig(
+    # Output
+    output_dir="./output/semanticwiki-gpt-oss",
+    run_name="semanticwiki-wiki-agent-v1",
+
+    # Training duration
+    num_train_epochs=3,
+    max_steps=-1,  # -1 = use epochs
+
+    # Batch size
+    per_device_train_batch_size=1,      # Keep low for long sequences
+    per_device_eval_batch_size=1,
+    gradient_accumulation_steps=8,       # Effective batch = 8
+
+    # Learning rate
+    learning_rate=2e-4,                  # Higher for LoRA
+    lr_scheduler_type="cosine_with_min_lr",
+    lr_scheduler_kwargs={"min_lr": 1e-5},
+    warmup_ratio=0.03,
+
+    # Optimization
+    optim="adamw_torch_fused",
+    weight_decay=0.01,
+    max_grad_norm=1.0,
+
+    # Precision
+    bf16=True,                           # Use bfloat16 (H100 optimal)
+    tf32=True,                           # TensorFloat-32 for matmuls
+
+    # Sequence length
+    max_seq_length=8192,                 # Adjust based on VRAM
+
+    # Logging
+    logging_steps=10,
+    logging_first_step=True,
+    report_to=["wandb", "tensorboard"],
+
+    # Evaluation
+    eval_strategy="steps",
+    eval_steps=100,
+
+    # Checkpointing
+    save_strategy="steps",
+    save_steps=500,
+    save_total_limit=3,
+    load_best_model_at_end=True,
+    metric_for_best_model="eval_loss",
+
+    # Efficiency
+    gradient_checkpointing=True,
+    gradient_checkpointing_kwargs={"use_reentrant": False},
+
+    # Dataset
+    dataset_text_field="text",
+    packing=True,                        # Pack sequences for efficiency
+)
+```
+
+### 4.2 Hyperparameter Tuning Ranges
+
+| Parameter | Range | Recommended Start |
+|-----------|-------|-------------------|
+| Learning Rate | 1e-5 to 5e-4 | 2e-4 |
+| LoRA Rank | 8 to 64 | 16 |
+| LoRA Alpha | 16 to 64 | 32 |
+| Batch Size (effective) | 4 to 32 | 8 |
+| Epochs | 1 to 5 | 3 |
+| Warmup Ratio | 0.01 to 0.1 | 0.03 |
+| Weight Decay | 0 to 0.1 | 0.01 |
+
+---
+
+## 5. Training Script
+
+### 5.1 Full Training Script
+
+```python
+#!/usr/bin/env python3
+"""
+Fine-tune gpt-oss-20b for SemanticWiki architectural documentation.
+"""
+
+import torch
+from datasets import load_dataset
+from transformers import (
+    AutoModelForCausalLM,
+    AutoTokenizer,
+    BitsAndBytesConfig,
+)
+from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
+from trl import SFTConfig, SFTTrainer
+import wandb
+
+# Configuration
+MODEL_ID = "openai/gpt-oss-20b"
+DATASET_PATH = "./data/processed/train.jsonl"
+OUTPUT_DIR = "./output/semanticwiki-gpt-oss"
+
+def main():
+    # Initialize wandb
+    wandb.init(
+        project="semanticwiki-finetuning",
+        name="gpt-oss-20b-wiki-agent-v1",
+        config={
+            "model": MODEL_ID,
+            "lora_r": 16,
+            "learning_rate": 2e-4,
+        }
+    )
+
+    # Load tokenizer
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+    tokenizer.pad_token = tokenizer.eos_token
+    tokenizer.padding_side = "right"
+
+    # Quantization config (for lower VRAM)
+    bnb_config = BitsAndBytesConfig(
+        load_in_4bit=True,
+        bnb_4bit_quant_type="nf4",
+        bnb_4bit_compute_dtype=torch.bfloat16,
+        bnb_4bit_use_double_quant=True,
+    )
+
+    # Load model
+    model = AutoModelForCausalLM.from_pretrained(
+        MODEL_ID,
+        quantization_config=bnb_config,
+        device_map="auto",
+        trust_remote_code=True,
+        attn_implementation="flash_attention_2",
+    )
+
+    # Prepare for training
+    model = prepare_model_for_kbit_training(model)
+
+    # LoRA configuration
+    lora_config = LoraConfig(
+        r=16,
+        lora_alpha=32,
+        lora_dropout=0.05,
+        target_modules=[
+            "q_proj", "k_proj", "v_proj", "o_proj",
+            "gate_proj", "up_proj", "down_proj",
+        ],
+        bias="none",
+        task_type="CAUSAL_LM",
+    )
+
+    model = get_peft_model(model, lora_config)
+    model.print_trainable_parameters()
+
+    # Load dataset
+    dataset = load_dataset("json", data_files={
+        "train": DATASET_PATH,
+        "validation": DATASET_PATH.replace("train", "validation"),
+    })
+
+    # Training arguments
+    training_args = SFTConfig(
+        output_dir=OUTPUT_DIR,
+        num_train_epochs=3,
+        per_device_train_batch_size=1,
+        gradient_accumulation_steps=8,
+        learning_rate=2e-4,
+        lr_scheduler_type="cosine_with_min_lr",
+        warmup_ratio=0.03,
+        bf16=True,
+        logging_steps=10,
+        eval_strategy="steps",
+        eval_steps=100,
+        save_strategy="steps",
+        save_steps=500,
+        save_total_limit=3,
+        load_best_model_at_end=True,
+        gradient_checkpointing=True,
+        max_seq_length=8192,
+        dataset_text_field="text",
+        packing=True,
+        report_to=["wandb"],
+    )
+
+    # Initialize trainer
+    trainer = SFTTrainer(
+        model=model,
+        args=training_args,
+        train_dataset=dataset["train"],
+        eval_dataset=dataset["validation"],
+        tokenizer=tokenizer,
+    )
+
+    # Train
+    trainer.train()
+
+    # Save final model
+    trainer.save_model(f"{OUTPUT_DIR}/final")
+    tokenizer.save_pretrained(f"{OUTPUT_DIR}/final")
+
+    # Merge LoRA weights (optional, for deployment)
+    merged_model = model.merge_and_unload()
+    merged_model.save_pretrained(f"{OUTPUT_DIR}/merged")
+
+    wandb.finish()
+
+if __name__ == "__main__":
+    main()
+```
+
+### 5.2 Unsloth Optimized Script (Lower VRAM)
+
+```python
+#!/usr/bin/env python3
+"""
+Fine-tune gpt-oss-20b with Unsloth for 80% memory reduction.
+Runs on 14GB VRAM (RTX 4070, 3090, etc.)
+"""
+
+from unsloth import FastLanguageModel
+from datasets import load_dataset
+from trl import SFTTrainer, SFTConfig
+
+# Configuration
+MODEL_ID = "openai/gpt-oss-20b"
+MAX_SEQ_LENGTH = 8192
+
+def main():
+    # Load model with Unsloth (native MXFP4 support)
+    model, tokenizer = FastLanguageModel.from_pretrained(
+        model_name=MODEL_ID,
+        max_seq_length=MAX_SEQ_LENGTH,
+        dtype=None,  # Auto-detect
+        load_in_4bit=True,
+    )
+
+    # Add LoRA adapters
+    model = FastLanguageModel.get_peft_model(
+        model,
+        r=16,
+        lora_alpha=32,
+        lora_dropout=0.05,
+        target_modules=[
+            "q_proj", "k_proj", "v_proj", "o_proj",
+            "gate_proj", "up_proj", "down_proj",
+        ],
+        bias="none",
+        use_gradient_checkpointing="unsloth",  # 30% more memory efficient
+        random_state=42,
+    )
+
+    # Load dataset
+    dataset = load_dataset("json", data_files={
+        "train": "./data/processed/train.jsonl"
+    })
+
+    # Training config
+    training_args = SFTConfig(
+        output_dir="./output/semanticwiki-gpt-oss-unsloth",
+        num_train_epochs=3,
+        per_device_train_batch_size=2,  # Can use larger batch with Unsloth
+        gradient_accumulation_steps=4,
+        learning_rate=2e-4,
+        warmup_ratio=0.03,
+        bf16=True,
+        logging_steps=10,
+        save_strategy="steps",
+        save_steps=500,
+        max_seq_length=MAX_SEQ_LENGTH,
+        dataset_text_field="text",
+        packing=True,
+    )
+
+    # Train
+    trainer = SFTTrainer(
+        model=model,
+        args=training_args,
+        train_dataset=dataset["train"],
+        tokenizer=tokenizer,
+    )
+
+    trainer.train()
+
+    # Save
+    model.save_pretrained_merged(
+        "./output/semanticwiki-gpt-oss-unsloth/merged",
+        tokenizer,
+        save_method="merged_16bit",
+    )
+
+if __name__ == "__main__":
+    main()
+```
+
+---
+
+## 6. Training Monitoring
+
+### 6.1 Key Metrics to Track
+
+| Metric | Target | Warning Signs |
+|--------|--------|---------------|
+| Training Loss | Decreasing steadily | Spikes, plateaus early |
+| Validation Loss | Decreasing, close to train | Increasing (overfitting) |
+| Learning Rate | Following schedule | N/A |
+| GPU Memory | <95% utilization | OOM errors |
+| Throughput | Consistent tokens/sec | Degradation |
+
+### 6.2 Wandb Dashboard Setup
+
+```python
+# Log custom metrics during training
+def compute_metrics(eval_preds):
+    predictions, labels = eval_preds
+
+    # Custom metrics for wiki quality
+    metrics = {
+        "has_source_refs": compute_source_ref_ratio(predictions),
+        "valid_markdown": compute_markdown_validity(predictions),
+        "mermaid_accuracy": compute_mermaid_accuracy(predictions),
+    }
+
+    return metrics
+```
+
+### 6.3 Early Stopping
+
+```python
+from transformers import EarlyStoppingCallback
+
+trainer = SFTTrainer(
+    # ... other args ...
+    callbacks=[
+        EarlyStoppingCallback(
+            early_stopping_patience=3,
+            early_stopping_threshold=0.001,
+        )
+    ],
+)
+```
+
+---
+
+## 7. Post-Training Processing
+
+### 7.1 Merge LoRA Weights
+
+```python
+from peft import PeftModel
+
+# Load base model
+base_model = AutoModelForCausalLM.from_pretrained(MODEL_ID)
+
+# Load LoRA adapter
+model = PeftModel.from_pretrained(base_model, "./output/semanticwiki-gpt-oss/final")
+
+# Merge weights
+merged_model = model.merge_and_unload()
+
+# Save merged model
+merged_model.save_pretrained("./output/semanticwiki-gpt-oss-merged")
+```
+
+### 7.2 Convert to GGUF (for SemanticWiki local mode)
+
+```bash
+# Clone llama.cpp
+git clone https://github.com/ggerganov/llama.cpp
+cd llama.cpp
+
+# Convert to GGUF
+python convert_hf_to_gguf.py \
+  ../output/semanticwiki-gpt-oss-merged \
+  --outfile ../output/semanticwiki-wiki-agent.gguf \
+  --outtype f16
+
+# Quantize (optional, for smaller size)
+./llama-quantize \
+  ../output/semanticwiki-wiki-agent.gguf \
+  ../output/semanticwiki-wiki-agent-q5_k_m.gguf \
+  q5_k_m
+```
+
+### 7.3 Upload to Hub (Optional)
+
+```python
+from huggingface_hub import HfApi
+
+api = HfApi()
+
+# Upload merged model
+api.upload_folder(
+    folder_path="./output/semanticwiki-gpt-oss-merged",
+    repo_id="your-org/semanticwiki-wiki-agent",
+    repo_type="model",
+)
+
+# Upload GGUF
+api.upload_file(
+    path_or_fileobj="./output/semanticwiki-wiki-agent-q5_k_m.gguf",
+    path_in_repo="semanticwiki-wiki-agent-q5_k_m.gguf",
+    repo_id="your-org/semanticwiki-wiki-agent-gguf",
+    repo_type="model",
+)
+```
+
+---
+
+## 8. Integration with SemanticWiki
+
+### 8.1 Using Fine-Tuned Model
+
+After training, use the model with SemanticWiki:
+
+```bash
+# Option 1: GGUF with local-llama-provider
+semanticwiki generate -r ./my-project \
+  --full-local \
+  --model-path ~/.semanticwiki/models/semanticwiki-wiki-agent-q5_k_m.gguf
+
+# Option 2: Via Ollama
+ollama create semanticwiki-agent -f Modelfile
+semanticwiki generate -r ./my-project \
+  --full-local --use-ollama --local-model semanticwiki-agent
+```
+
+### 8.2 Modelfile for Ollama
+
+```dockerfile
+# Modelfile
+FROM ./semanticwiki-wiki-agent-q5_k_m.gguf
+
+TEMPLATE """{{ if .System }}<|start|>system<|channel|>final<|end|>
+{{ .System }}<|start|>end<|end|>{{ end }}{{ if .Prompt }}<|start|>user<|channel|>final<|end|>
+{{ .Prompt }}<|start|>end<|end|>{{ end }}<|start|>assistant<|channel|>final<|end|>
+{{ .Response }}<|start|>end<|end|>"""
+
+PARAMETER temperature 0.7
+PARAMETER top_p 0.9
+PARAMETER num_ctx 32768
+PARAMETER stop "<|start|>end<|end|>"
+```
+
+---
+
+## 9. Training Time Estimates
+
+### By Dataset Size (H100 80GB)
+
+| Examples | Epochs | Estimated Time | Tokens Processed |
+|----------|--------|----------------|------------------|
+| 5,000 | 3 | ~10 minutes | ~50M |
+| 10,000 | 3 | ~17 minutes | ~100M |
+| 20,000 | 3 | ~30 minutes | ~200M |
+| 50,000 | 3 | ~75 minutes | ~500M |
+
+### By Hardware (20K examples, 3 epochs)
+
+| GPU | Time | Cost |
+|-----|------|------|
+| H100 80GB | 30 min | ~$2-3 |
+| A100 80GB | 45 min | ~$3-4 |
+| RTX 4090 24GB | 90 min | Consumer |
+| RTX 3090 24GB | 120 min | Consumer |
+
+---
+
+## 10. Troubleshooting
+
+### Common Issues
+
+| Issue | Cause | Solution |
+|-------|-------|----------|
+| OOM Error | Batch too large | Reduce `per_device_train_batch_size`, increase `gradient_accumulation_steps` |
+| Loss NaN | Learning rate too high | Reduce `learning_rate` to 1e-4 or 5e-5 |
+| No improvement | Data quality issues | Review training data, check format |
+| Slow training | No Flash Attention | Install `flash-attn`, use `attn_implementation="flash_attention_2"` |
+| Harmony format errors | Incorrect tokenization | Use `openai-harmony` library for formatting |
+
+### Memory Optimization Checklist
+
+```python
+# 1. Enable gradient checkpointing
+gradient_checkpointing=True
+
+# 2. Use 4-bit quantization
+load_in_4bit=True
+
+# 3. Use Unsloth (if available)
+from unsloth import FastLanguageModel
+
+# 4. Reduce sequence length
+max_seq_length=4096  # Instead of 8192
+
+# 5. Use smaller LoRA rank
+r=8  # Instead of 16
+
+# 6. Enable CPU offloading
+device_map="auto"  # Offloads to CPU when needed
+```
+
+---
+
+## 11. References
+
+- [gpt-oss-20b on HuggingFace](https://huggingface.co/openai/gpt-oss-20b)
+- [OpenAI Cookbook: Fine-tuning gpt-oss](https://cookbook.openai.com/articles/gpt-oss/fine-tune-transfomers)
+- [Harmony Response Format](https://github.com/openai/harmony)
+- [Unsloth Documentation](https://docs.unsloth.ai/models/gpt-oss-how-to-run-and-fine-tune)
+- [TRL SFTTrainer](https://huggingface.co/docs/trl/sft_trainer)
+- [PEFT LoRA](https://huggingface.co/docs/peft/conceptual_guides/lora)
+- [Analytics Vidhya: Fine-tuning gpt-oss](https://www.analyticsvidhya.com/blog/2025/10/finetuning-gpt-oss/)
diff --git a/fine-tuning/03-EVALUATION.md b/fine-tuning/03-EVALUATION.md
new file mode 100644
index 0000000..ea31261
--- /dev/null
+++ b/fine-tuning/03-EVALUATION.md
@@ -0,0 +1,792 @@
+# Evaluation Plan
+
+This document outlines the evaluation methodology to verify that the fine-tuned gpt-oss-20b model improves over the base model for architectural wiki generation in SemanticWiki.
+
+## Overview
+
+### Evaluation Goals
+
+1. **Demonstrate improvement** over base gpt-oss-20b on wiki generation tasks
+2. **Measure task-specific capabilities** (source traceability, diagram generation, etc.)
+3. **Ensure no regression** on general capabilities
+4. **Benchmark against alternatives** (Claude, Qwen 2.5 Coder, DeepWiki)
+
+### Evaluation Strategy
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                    Evaluation Pipeline                          │
+├─────────────────────────────────────────────────────────────────┤
+│  1. Automated Metrics    → BLEU, ROUGE, BERTScore, Custom       │
+│  2. CodeWikiBench        → Standardized benchmark comparison    │
+│  3. Task-Specific Evals  → Source refs, diagrams, tool use      │
+│  4. End-to-End Testing   → Full wiki generation on real repos   │
+│  5. Human Evaluation     → Expert review of generated wikis     │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## 1. Automated Metrics
+
+### 1.1 Standard NLG Metrics
+
+These metrics compare generated documentation against reference documentation.
+
+```python
+from evaluate import load
+import numpy as np
+
+# Load metrics
+bleu = load("bleu")
+rouge = load("rouge")
+bertscore = load("bertscore")
+
+def compute_standard_metrics(predictions: list[str], references: list[str]) -> dict:
+    """
+    Compute standard NLG metrics for documentation quality.
+    """
+    results = {}
+
+    # BLEU (n-gram precision)
+    bleu_result = bleu.compute(predictions=predictions, references=references)
+    results["bleu"] = bleu_result["bleu"]
+
+    # ROUGE (recall-oriented)
+    rouge_result = rouge.compute(predictions=predictions, references=references)
+    results["rouge1"] = rouge_result["rouge1"]
+    results["rouge2"] = rouge_result["rouge2"]
+    results["rougeL"] = rouge_result["rougeL"]
+
+    # BERTScore (semantic similarity)
+    bertscore_result = bertscore.compute(
+        predictions=predictions,
+        references=references,
+        lang="en",
+        model_type="microsoft/deberta-xlarge-mnli"
+    )
+    results["bertscore_f1"] = np.mean(bertscore_result["f1"])
+
+    return results
+```
+
+### 1.2 Target Scores
+
+| Metric | Base gpt-oss-20b | Target (Fine-tuned) | Improvement |
+|--------|------------------|---------------------|-------------|
+| BLEU | ~0.15 | >0.25 | +67% |
+| ROUGE-L | ~0.35 | >0.50 | +43% |
+| BERTScore F1 | ~0.70 | >0.80 | +14% |
+
+### 1.3 Limitations of Standard Metrics
+
+Standard metrics have known limitations for documentation:
+- BLEU penalizes valid paraphrasing
+- ROUGE doesn't capture semantic correctness
+- BERTScore may miss structural quality
+
+**Recommendation:** Use standard metrics as a baseline, but rely more heavily on task-specific and human evaluation.
+
+---
+
+## 2. CodeWikiBench Evaluation
+
+### 2.1 Benchmark Overview
+
+[CodeWikiBench](https://github.com/FSoft-AI4Code/CodeWikiBench) is the first benchmark for repository-level documentation quality.
+
+- **Repositories:** 22 projects across 6 languages
+- **Rubrics:** Hierarchical quality assessment criteria
+- **Baseline:** DeepWiki achieves 68.79% with proprietary models
+
+### 2.2 Running CodeWikiBench
+
+```bash
+# Clone benchmark
+git clone https://github.com/FSoft-AI4Code/CodeWikiBench
+cd CodeWikiBench
+
+# Install dependencies
+pip install -r requirements.txt
+
+# Load dataset
+python -c "
+from datasets import load_dataset
+dataset = load_dataset('anhnh2002/codewikibench')
+print(f'Loaded {len(dataset[\"train\"])} repositories')
+"
+```
+
+### 2.3 Evaluation Script
+
+```python
+from datasets import load_dataset
+from codewikibench import evaluate_documentation
+
+def evaluate_on_codewikibench(model, tokenizer, num_repos: int = 5):
+    """
+    Evaluate model on CodeWikiBench subset.
+    """
+    dataset = load_dataset("anhnh2002/codewikibench")
+
+    results = []
+    for repo in dataset["train"][:num_repos]:
+        # Generate documentation
+        generated_docs = generate_wiki_for_repo(
+            model, tokenizer,
+            repo_name=repo["repo_name"],
+            commit_id=repo["commit_id"]
+        )
+
+        # Evaluate against rubrics
+        scores = evaluate_documentation(
+            generated=generated_docs,
+            reference=repo["structured_docs"],
+            rubrics=repo["rubrics"]
+        )
+
+        results.append({
+            "repo": repo["repo_name"],
+            "scores": scores
+        })
+
+    return aggregate_results(results)
+```
+
+### 2.4 Target Performance
+
+| Model | CodeWikiBench Score | Notes |
+|-------|---------------------|-------|
+| DeepWiki (baseline) | 68.79% | Proprietary |
+| CodeWiki (open) | 64.80% | Open-source |
+| gpt-oss-20b (base) | ~55-60% | Estimated |
+| **gpt-oss-20b (fine-tuned)** | **>70%** | **Target** |
+
+---
+
+## 3. Task-Specific Evaluation
+
+### 3.1 Source Traceability Score
+
+Measures the model's ability to generate accurate `file:line` references.
+
+```python
+import re
+from pathlib import Path
+
+def evaluate_source_traceability(
+    generated_doc: str,
+    repo_path: str
+) -> dict:
+    """
+    Evaluate source reference accuracy.
+    """
+    # Extract file:line references
+    pattern = r'`([a-zA-Z0-9/_.-]+):(\d+)(?:-(\d+))?`'
+    references = re.findall(pattern, generated_doc)
+
+    total_refs = len(references)
+    valid_refs = 0
+    invalid_refs = []
+
+    for file_path, start_line, end_line in references:
+        full_path = Path(repo_path) / file_path
+
+        if full_path.exists():
+            lines = full_path.read_text().split('\n')
+            start = int(start_line)
+            end = int(end_line) if end_line else start
+
+            # Check line numbers are valid
+            if 1 <= start <= len(lines) and 1 <= end <= len(lines):
+                valid_refs += 1
+            else:
+                invalid_refs.append(f"{file_path}:{start_line} (line out of range)")
+        else:
+            invalid_refs.append(f"{file_path} (file not found)")
+
+    return {
+        "total_references": total_refs,
+        "valid_references": valid_refs,
+        "accuracy": valid_refs / total_refs if total_refs > 0 else 0,
+        "invalid_refs": invalid_refs,
+        "has_references": total_refs > 0
+    }
+```
+
+**Target Scores:**
+
+| Metric | Base | Target |
+|--------|------|--------|
+| Reference Accuracy | <50% | >90% |
+| References per 1K words | ~2 | >10 |
+| File existence accuracy | ~60% | >95% |
+
+### 3.2 Mermaid Diagram Quality
+
+Evaluate generated architecture diagrams.
+
+```python
+import subprocess
+import tempfile
+
+def evaluate_mermaid_diagrams(generated_doc: str) -> dict:
+    """
+    Extract and validate Mermaid diagrams.
+    """
+    # Extract mermaid blocks
+    mermaid_pattern = r'```mermaid\n(.*?)```'
+    diagrams = re.findall(mermaid_pattern, generated_doc, re.DOTALL)
+
+    results = {
+        "total_diagrams": len(diagrams),
+        "valid_syntax": 0,
+        "renders_successfully": 0,
+        "diagram_types": [],
+    }
+
+    for diagram in diagrams:
+        # Check syntax validity
+        if validate_mermaid_syntax(diagram):
+            results["valid_syntax"] += 1
+
+        # Check rendering
+        if render_mermaid(diagram):
+            results["renders_successfully"] += 1
+
+        # Identify diagram type
+        diagram_type = identify_diagram_type(diagram)
+        results["diagram_types"].append(diagram_type)
+
+    results["syntax_accuracy"] = (
+        results["valid_syntax"] / results["total_diagrams"]
+        if results["total_diagrams"] > 0 else 0
+    )
+
+    return results
+
+def validate_mermaid_syntax(diagram: str) -> bool:
+    """Validate Mermaid diagram syntax using mmdc CLI."""
+    try:
+        with tempfile.NamedTemporaryFile(mode='w', suffix='.mmd') as f:
+            f.write(diagram)
+            f.flush()
+            result = subprocess.run(
+                ['mmdc', '-i', f.name, '-o', '/dev/null', '--quiet'],
+                capture_output=True,
+                timeout=10
+            )
+            return result.returncode == 0
+    except Exception:
+        return False
+
+def identify_diagram_type(diagram: str) -> str:
+    """Identify the type of Mermaid diagram."""
+    first_line = diagram.strip().split('\n')[0].lower()
+    if 'flowchart' in first_line or 'graph' in first_line:
+        return 'flowchart'
+    elif 'sequencediagram' in first_line or 'sequence' in first_line:
+        return 'sequence'
+    elif 'classdiagram' in first_line or 'class' in first_line:
+        return 'class'
+    elif 'erdiagram' in first_line or 'er' in first_line:
+        return 'er'
+    elif 'statediagram' in first_line or 'state' in first_line:
+        return 'state'
+    else:
+        return 'unknown'
+```
+
+**Target Scores:**
+
+| Metric | Base | Target |
+|--------|------|--------|
+| Diagrams per wiki | ~0.5 | >3 |
+| Syntax validity | ~70% | >95% |
+| Renders successfully | ~60% | >90% |
+
+### 3.3 Tool Use Accuracy
+
+Evaluate the model's ability to use SemanticWiki tools correctly.
+
+```python
+def evaluate_tool_use(
+    model_outputs: list[dict],
+    expected_tools: list[str]
+) -> dict:
+    """
+    Evaluate tool calling accuracy.
+    """
+    results = {
+        "total_tool_calls": 0,
+        "valid_tool_calls": 0,
+        "invalid_tool_calls": [],
+        "tools_used": set(),
+        "expected_tools_used": 0,
+    }
+
+    for output in model_outputs:
+        if "tool_calls" in output:
+            for call in output["tool_calls"]:
+                results["total_tool_calls"] += 1
+                results["tools_used"].add(call["name"])
+
+                if validate_tool_call(call):
+                    results["valid_tool_calls"] += 1
+                else:
+                    results["invalid_tool_calls"].append(call)
+
+    # Check expected tools were used
+    for expected in expected_tools:
+        if expected in results["tools_used"]:
+            results["expected_tools_used"] += 1
+
+    results["tool_accuracy"] = (
+        results["valid_tool_calls"] / results["total_tool_calls"]
+        if results["total_tool_calls"] > 0 else 0
+    )
+
+    results["expected_coverage"] = (
+        results["expected_tools_used"] / len(expected_tools)
+        if expected_tools else 1
+    )
+
+    return results
+
+EXPECTED_WIKI_TOOLS = [
+    "search_codebase",
+    "read_file",
+    "analyze_code_structure",
+    "write_wiki_page",
+    "verify_wiki_completeness"
+]
+```
+
+**Target Scores:**
+
+| Metric | Base | Target |
+|--------|------|--------|
+| Tool call validity | ~80% | >95% |
+| Expected tools used | ~60% | >90% |
+| Tool call efficiency | N/A | <20 calls per page |
+
+### 3.4 Documentation Completeness
+
+Check that generated wikis cover all required sections.
+
+```python
+def evaluate_completeness(wiki_output: dict) -> dict:
+    """
+    Evaluate wiki completeness against expected structure.
+    """
+    expected_sections = {
+        "architecture_overview": False,
+        "component_docs": False,
+        "data_flow": False,
+        "getting_started": False,
+        "mermaid_diagrams": False,
+        "source_references": False,
+        "internal_links": False,
+    }
+
+    # Check each section
+    for page in wiki_output.get("pages", []):
+        content = page.get("content", "")
+
+        if "architecture" in page["path"].lower():
+            expected_sections["architecture_overview"] = True
+
+        if "component" in page["path"].lower() or "/components/" in page["path"]:
+            expected_sections["component_docs"] = True
+
+        if "data" in content.lower() and "flow" in content.lower():
+            expected_sections["data_flow"] = True
+
+        if "getting started" in content.lower() or "quickstart" in content.lower():
+            expected_sections["getting_started"] = True
+
+        if "```mermaid" in content:
+            expected_sections["mermaid_diagrams"] = True
+
+        if re.search(r'`[a-zA-Z0-9/_.-]+:\d+', content):
+            expected_sections["source_references"] = True
+
+        if re.search(r'\[.*?\]\(\.\./.*?\.md\)', content):
+            expected_sections["internal_links"] = True
+
+    completeness_score = sum(expected_sections.values()) / len(expected_sections)
+
+    return {
+        "sections": expected_sections,
+        "completeness_score": completeness_score,
+        "missing": [k for k, v in expected_sections.items() if not v]
+    }
+```
+
+**Target Scores:**
+
+| Metric | Base | Target |
+|--------|------|--------|
+| Section completeness | ~60% | >90% |
+| All required sections | No | Yes |
+
+---
+
+## 4. End-to-End Evaluation
+
+### 4.1 Test Repository Suite
+
+Create a diverse set of test repositories:
+
+| Repository | Language | Size | Complexity | Purpose |
+|------------|----------|------|------------|---------|
+| simple-api | TypeScript | 2K LOC | Low | Baseline test |
+| react-dashboard | TypeScript | 15K LOC | Medium | Frontend patterns |
+| fastapi-backend | Python | 10K LOC | Medium | Backend patterns |
+| microservices-demo | Go | 25K LOC | High | Distributed systems |
+| monorepo-example | Mixed | 50K LOC | High | Large codebase |
+
+### 4.2 End-to-End Test Script
+
+```python
+import subprocess
+import time
+from pathlib import Path
+
+def run_e2e_evaluation(
+    model_path: str,
+    test_repos: list[str],
+    output_dir: str
+) -> dict:
+    """
+    Run end-to-end wiki generation evaluation.
+    """
+    results = []
+
+    for repo_path in test_repos:
+        repo_name = Path(repo_path).name
+        wiki_output = Path(output_dir) / repo_name
+
+        # Time the generation
+        start_time = time.time()
+
+        # Run SemanticWiki with fine-tuned model
+        result = subprocess.run([
+            "semanticwiki", "generate",
+            "-r", repo_path,
+            "--full-local",
+            "--model-path", model_path,
+            "--output", str(wiki_output)
+        ], capture_output=True, text=True)
+
+        generation_time = time.time() - start_time
+
+        # Evaluate output
+        if result.returncode == 0:
+            wiki_quality = evaluate_wiki_output(wiki_output, repo_path)
+        else:
+            wiki_quality = {"error": result.stderr}
+
+        results.append({
+            "repo": repo_name,
+            "success": result.returncode == 0,
+            "generation_time": generation_time,
+            "quality": wiki_quality
+        })
+
+    return aggregate_e2e_results(results)
+
+def evaluate_wiki_output(wiki_path: Path, repo_path: str) -> dict:
+    """
+    Comprehensive evaluation of generated wiki.
+    """
+    wiki_content = load_wiki(wiki_path)
+
+    return {
+        "traceability": evaluate_source_traceability(
+            wiki_content["full_text"], repo_path
+        ),
+        "diagrams": evaluate_mermaid_diagrams(wiki_content["full_text"]),
+        "completeness": evaluate_completeness(wiki_content),
+        "broken_links": check_broken_links(wiki_path),
+        "word_count": count_words(wiki_content["full_text"]),
+        "page_count": len(wiki_content["pages"]),
+    }
+```
+
+### 4.3 Performance Benchmarks
+
+| Metric | Base gpt-oss-20b | Target (Fine-tuned) |
+|--------|------------------|---------------------|
+| Generation time (10K LOC) | ~15 min | <10 min |
+| Token efficiency | ~50K tokens | <30K tokens |
+| Retry rate | ~30% | <10% |
+| Success rate | ~80% | >95% |
+
+---
+
+## 5. Human Evaluation
+
+### 5.1 Evaluation Rubric
+
+Expert reviewers rate generated documentation on:
+
+| Criterion | Weight | Description |
+|-----------|--------|-------------|
+| **Accuracy** | 25% | Technical correctness of descriptions |
+| **Completeness** | 20% | Coverage of system architecture |
+| **Traceability** | 20% | Quality of source code references |
+| **Clarity** | 15% | Readability and organization |
+| **Diagrams** | 10% | Quality of visual representations |
+| **Usefulness** | 10% | Would a developer find this helpful? |
+
+### 5.2 Evaluation Protocol
+
+```markdown
+## Human Evaluation Instructions
+
+For each generated wiki, evaluate on a scale of 1-5:
+
+### 1. Accuracy (1-5)
+- Does the documentation correctly describe the code?
+- Are technical details accurate?
+- Are there any factual errors?
+
+### 2. Completeness (1-5)
+- Are all major components documented?
+- Is the architecture overview comprehensive?
+- Are data flows explained?
+
+### 3. Traceability (1-5)
+- Are source references provided?
+- Do file:line references point to correct locations?
+- Can you navigate from docs to code easily?
+
+### 4. Clarity (1-5)
+- Is the writing clear and professional?
+- Is the structure logical?
+- Is technical jargon explained?
+
+### 5. Diagrams (1-5)
+- Are diagrams relevant and accurate?
+- Do they aid understanding?
+- Are they properly formatted?
+
+### 6. Usefulness (1-5)
+- Would this help a new developer onboard?
+- Does it explain the "why" not just the "what"?
+- Would you recommend this documentation?
+```
+
+### 5.3 Sample Size
+
+- **Minimum:** 20 wiki generations (4 reviewers × 5 repos)
+- **Recommended:** 50 wiki generations (5 reviewers × 10 repos)
+- **Statistical power:** Detect 0.5 point improvement with 95% confidence
+
+### 5.4 Inter-Rater Reliability
+
+Calculate Krippendorff's alpha to ensure reviewer agreement:
+
+```python
+import krippendorff
+
+def calculate_inter_rater_reliability(ratings: list[list[float]]) -> float:
+    """
+    Calculate inter-rater reliability using Krippendorff's alpha.
+    """
+    alpha = krippendorff.alpha(
+        reliability_data=ratings,
+        level_of_measurement="interval"
+    )
+    return alpha
+
+# Target: α > 0.7 (acceptable agreement)
+```
+
+---
+
+## 6. Comparison Baselines
+
+### 6.1 Models to Compare
+
+| Model | Type | Purpose |
+|-------|------|---------|
+| gpt-oss-20b (base) | Open | Primary baseline |
+| gpt-oss-20b (fine-tuned) | Open | Our model |
+| Claude Sonnet | Proprietary | Quality ceiling |
+| Qwen 2.5 Coder 14B | Open | Current SemanticWiki local |
+| DeepWiki | Proprietary | Specialized baseline |
+
+### 6.2 Comparison Script
+
+```python
+def compare_models(
+    test_repos: list[str],
+    models: dict[str, callable]
+) -> dict:
+    """
+    Compare multiple models on the same test set.
+    """
+    results = {model_name: [] for model_name in models}
+
+    for repo in test_repos:
+        for model_name, generate_fn in models.items():
+            # Generate wiki
+            wiki = generate_fn(repo)
+
+            # Evaluate
+            scores = {
+                "traceability": evaluate_source_traceability(wiki, repo),
+                "diagrams": evaluate_mermaid_diagrams(wiki),
+                "completeness": evaluate_completeness(wiki),
+                "standard_metrics": compute_standard_metrics([wiki], [reference])
+            }
+
+            results[model_name].append(scores)
+
+    return aggregate_comparison(results)
+```
+
+### 6.3 Expected Results
+
+| Model | Source Refs | Diagrams | Completeness | Overall |
+|-------|-------------|----------|--------------|---------|
+| gpt-oss-20b (base) | 50% | 70% | 60% | 58% |
+| **gpt-oss-20b (fine-tuned)** | **92%** | **95%** | **90%** | **85%** |
+| Claude Sonnet | 85% | 90% | 88% | 87% |
+| Qwen 2.5 Coder 14B | 65% | 75% | 70% | 68% |
+
+---
+
+## 7. Regression Testing
+
+### 7.1 General Capability Tests
+
+Ensure fine-tuning doesn't harm general capabilities:
+
+```python
+def test_general_capabilities(model, tokenizer) -> dict:
+    """
+    Test that general capabilities are preserved.
+    """
+    tests = {
+        "code_completion": test_code_completion(model, tokenizer),
+        "code_explanation": test_code_explanation(model, tokenizer),
+        "bug_detection": test_bug_detection(model, tokenizer),
+        "refactoring": test_refactoring(model, tokenizer),
+    }
+
+    return tests
+
+def test_code_completion(model, tokenizer) -> float:
+    """
+    Test code completion on HumanEval-style problems.
+    """
+    from human_eval import evaluate_functional_correctness
+
+    # Generate completions
+    completions = generate_completions(model, tokenizer, HUMANEVAL_PROBLEMS)
+
+    # Evaluate
+    results = evaluate_functional_correctness(completions)
+
+    return results["pass@1"]
+```
+
+### 7.2 Regression Thresholds
+
+| Capability | Base Score | Min Acceptable |
+|------------|------------|----------------|
+| HumanEval pass@1 | 45% | >40% |
+| MBPP pass@1 | 55% | >50% |
+| Code explanation | 80% | >75% |
+
+---
+
+## 8. Evaluation Pipeline
+
+### 8.1 Directory Structure
+
+```
+fine-tuning/
+├── evaluation/
+│   ├── scripts/
+│   │   ├── run_standard_metrics.py
+│   │   ├── run_codewikibench.py
+│   │   ├── run_task_specific.py
+│   │   ├── run_e2e.py
+│   │   └── run_human_eval.py
+│   ├── test_repos/
+│   │   ├── simple-api/
+│   │   ├── react-dashboard/
+│   │   └── ...
+│   ├── results/
+│   │   ├── base_model/
+│   │   └── finetuned_model/
+│   └── reports/
+│       └── evaluation_report.md
+└── configs/
+    └── eval_config.yaml
+```
+
+### 8.2 Full Evaluation Command
+
+```bash
+# Run complete evaluation pipeline
+python evaluation/scripts/run_all_evaluations.py \
+  --base-model openai/gpt-oss-20b \
+  --finetuned-model ./output/semanticwiki-gpt-oss-merged \
+  --test-repos ./evaluation/test_repos \
+  --output ./evaluation/results \
+  --report ./evaluation/reports/evaluation_report.md
+```
+
+### 8.3 Evaluation Timeline
+
+| Phase | Duration | Dependencies |
+|-------|----------|--------------|
+| Standard metrics | 1-2 hours | Test set ready |
+| CodeWikiBench | 2-4 hours | Benchmark setup |
+| Task-specific | 2-3 hours | Test repos ready |
+| End-to-end | 4-8 hours | Full pipeline |
+| Human evaluation | 2-3 days | Evaluators recruited |
+
+---
+
+## 9. Success Criteria
+
+### 9.1 Minimum Viable Improvement
+
+The fine-tuned model must demonstrate:
+
+| Metric | Requirement |
+|--------|-------------|
+| Source traceability | >85% accuracy |
+| Mermaid validity | >90% |
+| Wiki completeness | >85% |
+| CodeWikiBench score | >65% |
+| Human eval (overall) | >4.0/5.0 |
+| No capability regression | >90% of base |
+
+### 9.2 Target Goals
+
+| Metric | Target |
+|--------|--------|
+| Source traceability | >92% accuracy |
+| Mermaid validity | >95% |
+| Wiki completeness | >90% |
+| CodeWikiBench score | >70% |
+| Human eval (overall) | >4.3/5.0 |
+| Generation speed | 50% faster than base |
+
+---
+
+## 10. References
+
+- [CodeWikiBench](https://github.com/FSoft-AI4Code/CodeWikiBench)
+- [CodeWiki Paper](https://arxiv.org/abs/2510.24428)
+- [LLM Evaluation Guide](https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation)
+- [BERTScore](https://github.com/Tiiiger/bert_score)
+- [HumanEval](https://github.com/openai/human-eval)
+- [Krippendorff's Alpha](https://github.com/pln-fing-udelar/fast-krippendorff)
diff --git a/fine-tuning/README.md b/fine-tuning/README.md
new file mode 100644
index 0000000..5672db0
--- /dev/null
+++ b/fine-tuning/README.md
@@ -0,0 +1,312 @@
+# Fine-Tuning gpt-oss-20b for SemanticWiki
+
+This directory contains comprehensive planning documents for fine-tuning OpenAI's gpt-oss-20b to become a specialized architectural wiki agent for SemanticWiki in local-only mode.
+
+## Project Goal
+
+Create a fine-tuned version of gpt-oss-20b that excels at:
+- Generating architectural documentation with source traceability
+- Creating accurate Mermaid diagrams
+- Using SemanticWiki tools effectively
+- Producing complete, well-structured wiki pages
+
+## Quick Start
+
+```bash
+# 1. Prepare dataset
+python scripts/prepare_dataset.py
+
+# 2. Fine-tune model
+python scripts/train.py --config configs/train_config.yaml
+
+# 3. Evaluate
+python scripts/evaluate.py --model ./output/merged
+
+# 4. Use with SemanticWiki
+semanticwiki generate -r ./your-repo --full-local --model-path ./output/model.gguf
+```
+
+## Plan Documents
+
+| Document | Description |
+|----------|-------------|
+| [01-DATASET-PREPARATION.md](./01-DATASET-PREPARATION.md) | Dataset collection, synthesis, and formatting |
+| [02-FINE-TUNING-EXECUTION.md](./02-FINE-TUNING-EXECUTION.md) | Training procedure, hyperparameters, scripts |
+| [03-EVALUATION.md](./03-EVALUATION.md) | Evaluation metrics, benchmarks, success criteria |
+
+---
+
+## Executive Summary
+
+### Model Selection: gpt-oss-20b
+
+| Property | Value |
+|----------|-------|
+| Parameters | 20.9B total, 3.6B active (MoE) |
+| Architecture | Mixture-of-Experts Transformer |
+| Context Length | 128K tokens |
+| Quantization | MXFP4 (fits in 16GB VRAM) |
+| License | Apache 2.0 |
+
+**Why gpt-oss-20b?**
+- Strong reasoning capabilities from OpenAI training
+- Efficient MoE architecture for fast inference
+- Runs on consumer hardware with quantization
+- Apache 2.0 license allows commercial use
+
+### Training Data Strategy
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                     Training Data Mix                           │
+├─────────────────────────────────────────────────────────────────┤
+│  Real Data (40%)                                                │
+│  ├─ CodeWikiBench: 22 repos, ~8K examples                       │
+│  └─ DeepWiki crawl: 500+ repos, ~15K examples                   │
+│                                                                 │
+│  Synthetic Data (50%)                                           │
+│  ├─ Claude-distilled: ~20K architecture docs                    │
+│  ├─ Self-instruct: ~5K variations                               │
+│  └─ Tool-use trajectories: ~5K examples                         │
+│                                                                 │
+│  SemanticWiki-Specific (10%)                                    │
+│  ├─ Source traceability examples: ~3K                           │
+│  └─ Multi-turn refinement: ~2K                                  │
+├─────────────────────────────────────────────────────────────────┤
+│  Total: 50,000+ high-quality examples                           │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+### Hardware Requirements
+
+| Configuration | VRAM | Training Time (50K examples) | Estimated Cost |
+|---------------|------|------------------------------|----------------|
+| H100 80GB | 80GB | 1.5-2 hours | ~$8-12 |
+| A100 80GB | 80GB | 2-3 hours | ~$10-15 |
+| RTX 4090 (Unsloth) | 24GB | 4-6 hours | Consumer HW |
+
+### Expected Improvements
+
+| Metric | Base gpt-oss-20b | Fine-tuned Target |
+|--------|------------------|-------------------|
+| Source traceability | ~50% | >90% |
+| Mermaid diagram validity | ~70% | >95% |
+| Wiki completeness | ~60% | >90% |
+| CodeWikiBench score | ~55% | >70% |
+| Tool use accuracy | ~80% | >95% |
+| Generation efficiency | Baseline | 2x faster |
+
+---
+
+## Timeline Overview
+
+### Phase 1: Data Preparation (3-5 days)
+
+| Task | Duration | Output |
+|------|----------|--------|
+| Download CodeWikiBench | 1 hour | Raw dataset |
+| Crawl DeepWiki | 1-2 days | 15K+ examples |
+| Generate synthetic data | 2-3 days | 30K+ examples |
+| Quality filtering | 4-8 hours | Clean dataset |
+| Format to Harmony | 1-2 hours | train.jsonl |
+
+**Estimated cost:** $300-600 (synthetic generation API calls)
+
+### Phase 2: Fine-Tuning (4-8 hours)
+
+| Task | Duration | Output |
+|------|----------|--------|
+| Environment setup | 30 min | Ready to train |
+| Training run | 1.5-3 hours | LoRA weights |
+| Merge weights | 15 min | Merged model |
+| Convert to GGUF | 30 min | Deployable model |
+
+**Estimated cost:** $10-20 (cloud GPU)
+
+### Phase 3: Evaluation (2-5 days)
+
+| Task | Duration | Output |
+|------|----------|--------|
+| Automated metrics | 2-4 hours | Metric scores |
+| CodeWikiBench | 4-8 hours | Benchmark results |
+| End-to-end tests | 8-12 hours | Wiki samples |
+| Human evaluation | 2-3 days | Expert ratings |
+
+**Estimated cost:** $0-100 (compute + optional human eval)
+
+### Total Timeline
+
+- **Minimum:** 5-7 days
+- **Recommended:** 10-14 days (including iteration)
+- **Total estimated cost:** $350-750
+
+---
+
+## Key Technical Decisions
+
+### 1. LoRA vs Full Fine-Tuning
+
+**Decision:** Use LoRA (Low-Rank Adaptation)
+
+- Reduces VRAM from 300GB+ to 14-44GB
+- Preserves base model capabilities
+- Enables consumer hardware training
+- Faster training iterations
+
+### 2. Data Format: Harmony
+
+**Decision:** Use OpenAI's Harmony response format
+
+```python
+from openai_harmony import Renderer
+
+# gpt-oss requires Harmony format for correct behavior
+renderer = Renderer()
+formatted = renderer.render(messages)
+```
+
+### 3. Quantization: MXFP4 → GGUF
+
+**Decision:** Train in native format, export to GGUF
+
+- Train with MXFP4 (gpt-oss native)
+- Export to GGUF for llama.cpp / Ollama compatibility
+- Enables SemanticWiki local-only mode
+
+### 4. Evaluation: Multi-Modal Approach
+
+**Decision:** Combine automated + human evaluation
+
+- Standard metrics (BLEU, ROUGE) as baseline
+- Task-specific metrics (source refs, diagrams) as primary
+- CodeWikiBench for standardized comparison
+- Human evaluation for quality assurance
+
+---
+
+## Integration with SemanticWiki
+
+After training, the fine-tuned model integrates seamlessly:
+
+```bash
+# Option 1: Direct GGUF
+semanticwiki generate -r ./my-repo \
+  --full-local \
+  --model-path ~/.semanticwiki/models/semanticwiki-wiki-agent.gguf
+
+# Option 2: Via Ollama
+ollama create semanticwiki-agent -f Modelfile
+semanticwiki generate -r ./my-repo \
+  --full-local --use-ollama --local-model semanticwiki-agent
+```
+
+The model will be loaded by `LocalLlamaProvider` or `OllamaProvider` in the SemanticWiki architecture:
+
+```
+CLI (--model-path)
+     ↓
+createLLMProvider() factory
+     ↓
+LocalLlamaProvider / OllamaProvider
+     ↓
+WikiAgent (uses fine-tuned model)
+```
+
+---
+
+## Success Criteria
+
+### Minimum Viable Product
+
+- [ ] Source traceability accuracy >85%
+- [ ] Mermaid diagram validity >90%
+- [ ] Wiki completeness >85%
+- [ ] No regression on general capabilities
+- [ ] Works in SemanticWiki local mode
+
+### Stretch Goals
+
+- [ ] CodeWikiBench score >70% (beat DeepWiki)
+- [ ] Human eval rating >4.3/5.0
+- [ ] Generation speed 2x faster than base
+- [ ] Support for 10+ programming languages
+
+---
+
+## References
+
+### Primary Sources
+
+- [gpt-oss-20b on HuggingFace](https://huggingface.co/openai/gpt-oss-20b)
+- [OpenAI Cookbook: Fine-tuning gpt-oss](https://cookbook.openai.com/articles/gpt-oss/fine-tune-transfomers)
+- [Harmony Response Format](https://github.com/openai/harmony)
+- [Unsloth Documentation](https://docs.unsloth.ai/models/gpt-oss-how-to-run-and-fine-tune)
+
+### Datasets
+
+- [CodeWikiBench](https://huggingface.co/datasets/anhnh2002/codewikibench)
+- [CodeWiki Paper](https://arxiv.org/abs/2510.24428)
+- [DeepWiki](https://deepwiki.org/)
+- [OpenDeepWiki](https://github.com/AIDotNet/OpenDeepWiki)
+
+### Training Resources
+
+- [TRL SFTTrainer](https://huggingface.co/docs/trl/sft_trainer)
+- [PEFT LoRA Guide](https://huggingface.co/docs/peft/conceptual_guides/lora)
+- [Synthetic Data Survey](https://arxiv.org/abs/2503.14023)
+
+### Evaluation
+
+- [LLM Evaluation Guide](https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation)
+- [CodeWikiBench GitHub](https://github.com/FSoft-AI4Code/CodeWikiBench)
+
+---
+
+## Directory Structure
+
+```
+fine-tuning/
+├── README.md                        # This file
+├── 01-DATASET-PREPARATION.md        # Dataset plan
+├── 02-FINE-TUNING-EXECUTION.md      # Training plan
+├── 03-EVALUATION.md                 # Evaluation plan
+├── configs/                         # Configuration files
+│   ├── train_config.yaml
+│   ├── data_config.yaml
+│   └── eval_config.yaml
+├── scripts/                         # Implementation scripts
+│   ├── prepare_dataset.py
+│   ├── train.py
+│   ├── evaluate.py
+│   └── convert_gguf.py
+├── data/                            # Training data
+│   ├── raw/
+│   ├── processed/
+│   └── quality_reports/
+├── evaluation/                      # Evaluation resources
+│   ├── test_repos/
+│   ├── results/
+│   └── reports/
+└── output/                          # Training outputs
+    ├── checkpoints/
+    ├── merged/
+    └── gguf/
+```
+
+---
+
+## Next Steps
+
+1. **Review plans** - Ensure all requirements are captured
+2. **Set up environment** - Install dependencies, get GPU access
+3. **Prepare data** - Run dataset preparation pipeline
+4. **Train model** - Execute fine-tuning with monitoring
+5. **Evaluate** - Run full evaluation suite
+6. **Iterate** - Refine based on results
+7. **Deploy** - Integrate with SemanticWiki
+
+---
+
+## Contact
+
+For questions about this fine-tuning project, refer to the SemanticWiki documentation or open an issue in the repository.

From 35764100650450be0ac6a8de28f4f4241dd08b82 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Mon, 19 Jan 2026 22:33:33 +0000
Subject: [PATCH 2/3] Add quick-start guide for same-night fine-tuning

Adds QUICK-START.md with a minimal 3-4 hour plan:
- Uses only CodeWikiBench (no crawling/synthetic gen)
- Single epoch LoRA training
- Simple test script to verify output
- Cloud GPU options for those without local hardware
---
 fine-tuning/QUICK-START.md | 285 +++++++++++++++++++++++++++++++++++++
 fine-tuning/README.md      |  15 +-
 2 files changed, 299 insertions(+), 1 deletion(-)
 create mode 100644 fine-tuning/QUICK-START.md

diff --git a/fine-tuning/QUICK-START.md b/fine-tuning/QUICK-START.md
new file mode 100644
index 0000000..94e5c0c
--- /dev/null
+++ b/fine-tuning/QUICK-START.md
@@ -0,0 +1,285 @@
+# Tonight's Plan: Quick Fine-Tune gpt-oss-20b
+
+A minimal, achievable plan to fine-tune gpt-oss-20b for SemanticWiki in one evening (~3-4 hours).
+
+## Prerequisites
+
+- [ ] GPU with 24GB+ VRAM (RTX 3090/4090) OR cloud GPU access (RunPod/Lambda)
+- [ ] Python 3.10+
+- [ ] ~$5-10 for cloud GPU (if not using local)
+
+## Timeline
+
+| Phase | Time | Task |
+|-------|------|------|
+| Setup | 20 min | Install deps, download model |
+| Data | 30 min | Download CodeWikiBench, format |
+| Train | 1-2 hrs | Run LoRA fine-tuning |
+| Test | 30 min | Generate wiki, check quality |
+
+---
+
+## Step 1: Environment Setup (20 min)
+
+```bash
+# Create environment
+python -m venv venv
+source venv/bin/activate
+
+# Install dependencies
+pip install torch transformers accelerate peft trl datasets bitsandbytes
+
+# Optional: Unsloth for 2x speed (recommended)
+pip install unsloth
+```
+
+## Step 2: Download & Format Data (30 min)
+
+Create `prepare_data.py`:
+
+```python
+#!/usr/bin/env python3
+"""Quick data prep using CodeWikiBench only."""
+
+from datasets import load_dataset
+import json
+
+# Load CodeWikiBench
+print("Downloading CodeWikiBench...")
+dataset = load_dataset("anhnh2002/codewikibench")
+
+# Simple formatting - just use the docs as-is
+examples = []
+for item in dataset["train"]:
+    # Create instruction-response pairs from structured docs
+    if item.get("structured_docs"):
+        examples.append({
+            "text": f"""<|im_start|>system
+You are an expert software architect who creates documentation wikis with source code traceability.
+<|im_end|>
+<|im_start|>user
+Generate architectural documentation for the {item['repo_name']} repository.
+<|im_end|>
+<|im_start|>assistant
+{json.dumps(item['structured_docs'], indent=2)[:8000]}
+<|im_end|>"""
+        })
+
+print(f"Created {len(examples)} examples")
+
+# Save
+with open("train_data.jsonl", "w") as f:
+    for ex in examples[:5000]:  # Limit to 5K for quick training
+        f.write(json.dumps(ex) + "\n")
+
+print("Saved to train_data.jsonl")
+```
+
+Run it:
+```bash
+python prepare_data.py
+```
+
+## Step 3: Train (1-2 hours)
+
+Create `train.py`:
+
+```python
+#!/usr/bin/env python3
+"""Quick LoRA fine-tuning of gpt-oss-20b."""
+
+from datasets import load_dataset
+from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
+from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
+from trl import SFTConfig, SFTTrainer
+import torch
+
+MODEL_ID = "openai/gpt-oss-20b"
+
+def main():
+    print("Loading tokenizer...")
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+    tokenizer.pad_token = tokenizer.eos_token
+
+    print("Loading model (4-bit quantized)...")
+    model = AutoModelForCausalLM.from_pretrained(
+        MODEL_ID,
+        quantization_config=BitsAndBytesConfig(
+            load_in_4bit=True,
+            bnb_4bit_quant_type="nf4",
+            bnb_4bit_compute_dtype=torch.bfloat16,
+        ),
+        device_map="auto",
+        trust_remote_code=True,
+    )
+
+    model = prepare_model_for_kbit_training(model)
+
+    print("Adding LoRA adapters...")
+    model = get_peft_model(model, LoraConfig(
+        r=16,
+        lora_alpha=32,
+        lora_dropout=0.05,
+        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
+                       "gate_proj", "up_proj", "down_proj"],
+        bias="none",
+        task_type="CAUSAL_LM",
+    ))
+    model.print_trainable_parameters()
+
+    print("Loading dataset...")
+    dataset = load_dataset("json", data_files="train_data.jsonl", split="train")
+
+    print("Starting training...")
+    trainer = SFTTrainer(
+        model=model,
+        args=SFTConfig(
+            output_dir="./output",
+            num_train_epochs=1,  # Just 1 epoch for tonight
+            per_device_train_batch_size=1,
+            gradient_accumulation_steps=4,
+            learning_rate=2e-4,
+            bf16=True,
+            logging_steps=10,
+            save_steps=500,
+            max_seq_length=4096,
+            dataset_text_field="text",
+        ),
+        train_dataset=dataset,
+        tokenizer=tokenizer,
+    )
+
+    trainer.train()
+    trainer.save_model("./output/final")
+    print("Done! Model saved to ./output/final")
+
+if __name__ == "__main__":
+    main()
+```
+
+Run it:
+```bash
+python train.py
+```
+
+**Expected output:**
+```
+Loading tokenizer...
+Loading model (4-bit quantized)...
+Adding LoRA adapters...
+trainable params: 50,331,648 || all params: 20,900,000,000 || trainable%: 0.24%
+Loading dataset...
+Starting training...
+{'loss': 2.1, 'step': 10}
+{'loss': 1.8, 'step': 20}
+...
+Done! Model saved to ./output/final
+```
+
+## Step 4: Quick Test (30 min)
+
+Create `test.py`:
+
+```python
+#!/usr/bin/env python3
+"""Quick test of fine-tuned model."""
+
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from peft import PeftModel
+import torch
+
+MODEL_ID = "openai/gpt-oss-20b"
+ADAPTER_PATH = "./output/final"
+
+# Load
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+base_model = AutoModelForCausalLM.from_pretrained(
+    MODEL_ID,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True,
+)
+model = PeftModel.from_pretrained(base_model, ADAPTER_PATH)
+
+# Test prompt
+prompt = """<|im_start|>system
+You are an expert software architect who creates documentation wikis with source code traceability.
+<|im_end|>
+<|im_start|>user
+Generate an architecture overview for a Node.js Express API with user authentication and a PostgreSQL database.
+<|im_end|>
+<|im_start|>assistant
+"""
+
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+outputs = model.generate(**inputs, max_new_tokens=1000, temperature=0.7)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+
+Run:
+```bash
+python test.py
+```
+
+---
+
+## Cloud GPU Option (If No Local GPU)
+
+### RunPod (~$2-3 for tonight)
+
+1. Go to [runpod.io](https://runpod.io)
+2. Launch "RTX 4090" template (~$0.44/hr)
+3. Select PyTorch template
+4. SSH in and run the steps above
+
+### Google Colab (Free but slower)
+
+Use this notebook structure:
+```python
+# Cell 1: Install
+!pip install torch transformers accelerate peft trl datasets bitsandbytes
+
+# Cell 2: Prepare data (copy prepare_data.py)
+
+# Cell 3: Train (copy train.py, reduce to 1000 examples)
+
+# Cell 4: Test (copy test.py)
+```
+
+---
+
+## What You'll Have by Tonight
+
+1. **LoRA adapter** (`./output/final/`) - ~100MB of fine-tuned weights
+2. **Basic validation** - Model generates wiki-style documentation
+3. **Foundation to iterate** - Can improve data/training tomorrow
+
+## Tomorrow's Improvements (Optional)
+
+- [ ] Add synthetic data for source traceability (`file:line` refs)
+- [ ] Train for 3 epochs instead of 1
+- [ ] Run proper evaluation
+- [ ] Convert to GGUF for SemanticWiki integration
+
+---
+
+## Troubleshooting
+
+| Issue | Fix |
+|-------|-----|
+| OOM error | Reduce `max_seq_length` to 2048 |
+| Slow download | Model is ~40GB, use fast connection |
+| CUDA error | Update: `pip install torch --upgrade` |
+| Import error | Install missing: `pip install einops` |
+
+## Quick Sanity Check
+
+Before training, verify setup:
+```bash
+python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, GPU: {torch.cuda.get_device_name(0)}')"
+python -c "from transformers import AutoTokenizer; t = AutoTokenizer.from_pretrained('openai/gpt-oss-20b'); print('Tokenizer OK')"
+```
+
+---
+
+That's it! ~3 hours from start to a working fine-tuned model.
diff --git a/fine-tuning/README.md b/fine-tuning/README.md
index 5672db0..2bc5d57 100644
--- a/fine-tuning/README.md
+++ b/fine-tuning/README.md
@@ -1,6 +1,19 @@
 # Fine-Tuning gpt-oss-20b for SemanticWiki
 
-This directory contains comprehensive planning documents for fine-tuning OpenAI's gpt-oss-20b to become a specialized architectural wiki agent for SemanticWiki in local-only mode.
+This directory contains planning documents for fine-tuning OpenAI's gpt-oss-20b to become a specialized architectural wiki agent for SemanticWiki in local-only mode.
+
+## Quick Start (Tonight!)
+
+**Want results in 3-4 hours?** See [QUICK-START.md](./QUICK-START.md) for a minimal, achievable plan.
+
+```bash
+pip install torch transformers peft trl datasets bitsandbytes
+python prepare_data.py   # 30 min - downloads CodeWikiBench
+python train.py          # 1-2 hrs - LoRA fine-tuning
+python test.py           # 5 min - verify it works
+```
+
+---
 
 ## Project Goal
 

From a1a4d6ec81c7e514ece3fc58a4f7a9d974b72ef2 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Tue, 20 Jan 2026 00:05:28 +0000
Subject: [PATCH 3/3] Update quick-start to reflect 1K example approach from
 OpenAI cookbook

---
 fine-tuning/QUICK-START.md | 312 ++++++++-----------------------------
 1 file changed, 62 insertions(+), 250 deletions(-)

diff --git a/fine-tuning/QUICK-START.md b/fine-tuning/QUICK-START.md
index 94e5c0c..6ba4c59 100644
--- a/fine-tuning/QUICK-START.md
+++ b/fine-tuning/QUICK-START.md
@@ -1,285 +1,97 @@
-# Tonight's Plan: Quick Fine-Tune gpt-oss-20b
+# Tonight's Plan: Fine-Tune gpt-oss-20b (~1 hour)
 
-A minimal, achievable plan to fine-tune gpt-oss-20b for SemanticWiki in one evening (~3-4 hours).
+Based on [OpenAI Cookbook guidance](https://cookbook.openai.com/articles/gpt-oss/fine-tune-transfomers):
 
-## Prerequisites
-
-- [ ] GPU with 24GB+ VRAM (RTX 3090/4090) OR cloud GPU access (RunPod/Lambda)
-- [ ] Python 3.10+
-- [ ] ~$5-10 for cloud GPU (if not using local)
+> "This is a small dataset of 1,000 examples, but this is usually more than sufficient for models like openai/gpt-oss-20b which have undergone extensive post-training."
 
 ## Timeline
 
-| Phase | Time | Task |
-|-------|------|------|
-| Setup | 20 min | Install deps, download model |
-| Data | 30 min | Download CodeWikiBench, format |
-| Train | 1-2 hrs | Run LoRA fine-tuning |
-| Test | 30 min | Generate wiki, check quality |
+| Step | Time | Cost |
+|------|------|------|
+| Setup | 5 min | Free |
+| Data prep | 15 min | Free |
+| Synthetic gen (optional) | 15 min | ~$2 |
+| Training (3 epochs) | 20-30 min | Free/local |
+| Testing | 5 min | Free |
+| **Total** | **~1 hour** | **~$2** |
 
----
+## Use the Dedicated Repo
 
-## Step 1: Environment Setup (20 min)
+A complete toolkit has been set up at `../semanticwiki-finetune/`:
 
 ```bash
-# Create environment
-python -m venv venv
-source venv/bin/activate
-
-# Install dependencies
-pip install torch transformers accelerate peft trl datasets bitsandbytes
-
-# Optional: Unsloth for 2x speed (recommended)
-pip install unsloth
-```
-
-## Step 2: Download & Format Data (30 min)
-
-Create `prepare_data.py`:
+cd ../semanticwiki-finetune
 
-```python
-#!/usr/bin/env python3
-"""Quick data prep using CodeWikiBench only."""
+# 1. Setup
+pip install -r requirements.txt
 
-from datasets import load_dataset
-import json
+# 2. Prepare ~1,000 high-quality examples from CodeWikiBench
+python scripts/prepare_data.py
 
-# Load CodeWikiBench
-print("Downloading CodeWikiBench...")
-dataset = load_dataset("anhnh2002/codewikibench")
+# 3. Optional: Add 200 synthetic examples for source traceability (~$2)
+export ANTHROPIC_API_KEY=your_key
+python scripts/generate_synthetic.py --num-examples 200
 
-# Simple formatting - just use the docs as-is
-examples = []
-for item in dataset["train"]:
-    # Create instruction-response pairs from structured docs
-    if item.get("structured_docs"):
-        examples.append({
-            "text": f"""<|im_start|>system
-You are an expert software architect who creates documentation wikis with source code traceability.
-<|im_end|>
-<|im_start|>user
-Generate architectural documentation for the {item['repo_name']} repository.
-<|im_end|>
-<|im_start|>assistant
-{json.dumps(item['structured_docs'], indent=2)[:8000]}
-<|im_end|>"""
-        })
+# 4. Train (3 epochs, ~25 min on RTX 4090)
+python scripts/train.py
 
-print(f"Created {len(examples)} examples")
-
-# Save
-with open("train_data.jsonl", "w") as f:
-    for ex in examples[:5000]:  # Limit to 5K for quick training
-        f.write(json.dumps(ex) + "\n")
-
-print("Saved to train_data.jsonl")
+# 5. Test
+python scripts/test.py
 ```
 
-Run it:
-```bash
-python prepare_data.py
-```
-
-## Step 3: Train (1-2 hours)
-
-Create `train.py`:
-
-```python
-#!/usr/bin/env python3
-"""Quick LoRA fine-tuning of gpt-oss-20b."""
+## What's in the Repo
 
-from datasets import load_dataset
-from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
-from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
-from trl import SFTConfig, SFTTrainer
-import torch
-
-MODEL_ID = "openai/gpt-oss-20b"
-
-def main():
-    print("Loading tokenizer...")
-    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
-    tokenizer.pad_token = tokenizer.eos_token
-
-    print("Loading model (4-bit quantized)...")
-    model = AutoModelForCausalLM.from_pretrained(
-        MODEL_ID,
-        quantization_config=BitsAndBytesConfig(
-            load_in_4bit=True,
-            bnb_4bit_quant_type="nf4",
-            bnb_4bit_compute_dtype=torch.bfloat16,
-        ),
-        device_map="auto",
-        trust_remote_code=True,
-    )
-
-    model = prepare_model_for_kbit_training(model)
-
-    print("Adding LoRA adapters...")
-    model = get_peft_model(model, LoraConfig(
-        r=16,
-        lora_alpha=32,
-        lora_dropout=0.05,
-        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
-                       "gate_proj", "up_proj", "down_proj"],
-        bias="none",
-        task_type="CAUSAL_LM",
-    ))
-    model.print_trainable_parameters()
-
-    print("Loading dataset...")
-    dataset = load_dataset("json", data_files="train_data.jsonl", split="train")
-
-    print("Starting training...")
-    trainer = SFTTrainer(
-        model=model,
-        args=SFTConfig(
-            output_dir="./output",
-            num_train_epochs=1,  # Just 1 epoch for tonight
-            per_device_train_batch_size=1,
-            gradient_accumulation_steps=4,
-            learning_rate=2e-4,
-            bf16=True,
-            logging_steps=10,
-            save_steps=500,
-            max_seq_length=4096,
-            dataset_text_field="text",
-        ),
-        train_dataset=dataset,
-        tokenizer=tokenizer,
-    )
-
-    trainer.train()
-    trainer.save_model("./output/final")
-    print("Done! Model saved to ./output/final")
-
-if __name__ == "__main__":
-    main()
 ```
-
-Run it:
-```bash
-python train.py
+semanticwiki-finetune/
+├── scripts/
+│   ├── prepare_data.py       # CodeWikiBench → training format
+│   ├── generate_synthetic.py # Claude API for targeted examples
+│   ├── train.py              # QLoRA fine-tuning
+│   ├── test.py               # Validate output quality
+│   ├── merge_and_export.py   # Merge weights, convert to GGUF
+│   └── publish_dataset.py    # Upload to HuggingFace
+├── requirements.txt
+└── README.md                 # Full documentation
 ```
 
-**Expected output:**
-```
-Loading tokenizer...
-Loading model (4-bit quantized)...
-Adding LoRA adapters...
-trainable params: 50,331,648 || all params: 20,900,000,000 || trainable%: 0.24%
-Loading dataset...
-Starting training...
-{'loss': 2.1, 'step': 10}
-{'loss': 1.8, 'step': 20}
-...
-Done! Model saved to ./output/final
-```
+## Hardware
 
-## Step 4: Quick Test (30 min)
+| GPU | Time (1.2K examples, 3 epochs) |
+|-----|-------------------------------|
+| RTX 4090 (24GB) | ~25 min |
+| RTX 3090 (24GB) | ~30 min |
+| A100 (80GB) | ~15 min |
+| **Cloud (RunPod)** | ~25 min, ~$0.20 |
 
-Create `test.py`:
+## Expected Results
 
-```python
-#!/usr/bin/env python3
-"""Quick test of fine-tuned model."""
+| Metric | Base | After Fine-tuning |
+|--------|------|-------------------|
+| Source traceability | ~50% | ~85% |
+| Mermaid validity | ~70% | ~90% |
+| Wiki completeness | ~60% | ~85% |
 
-from transformers import AutoModelForCausalLM, AutoTokenizer
-from peft import PeftModel
-import torch
+## Publish Dataset to HuggingFace
 
-MODEL_ID = "openai/gpt-oss-20b"
-ADAPTER_PATH = "./output/final"
+After training, share your dataset:
 
-# Load
-tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
-base_model = AutoModelForCausalLM.from_pretrained(
-    MODEL_ID,
-    torch_dtype=torch.bfloat16,
-    device_map="auto",
-    trust_remote_code=True,
-)
-model = PeftModel.from_pretrained(base_model, ADAPTER_PATH)
-
-# Test prompt
-prompt = """<|im_start|>system
-You are an expert software architect who creates documentation wikis with source code traceability.
-<|im_end|>
-<|im_start|>user
-Generate an architecture overview for a Node.js Express API with user authentication and a PostgreSQL database.
-<|im_end|>
-<|im_start|>assistant
-"""
-
-inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
-outputs = model.generate(**inputs, max_new_tokens=1000, temperature=0.7)
-print(tokenizer.decode(outputs[0], skip_special_tokens=True))
-```
-
-Run:
 ```bash
-python test.py
-```
-
----
-
-## Cloud GPU Option (If No Local GPU)
-
-### RunPod (~$2-3 for tonight)
-
-1. Go to [runpod.io](https://runpod.io)
-2. Launch "RTX 4090" template (~$0.44/hr)
-3. Select PyTorch template
-4. SSH in and run the steps above
-
-### Google Colab (Free but slower)
-
-Use this notebook structure:
-```python
-# Cell 1: Install
-!pip install torch transformers accelerate peft trl datasets bitsandbytes
-
-# Cell 2: Prepare data (copy prepare_data.py)
-
-# Cell 3: Train (copy train.py, reduce to 1000 examples)
-
-# Cell 4: Test (copy test.py)
+huggingface-cli login
+python scripts/publish_dataset.py --repo-id your-username/semanticwiki-data
 ```
 
----
-
-## What You'll Have by Tonight
-
-1. **LoRA adapter** (`./output/final/`) - ~100MB of fine-tuned weights
-2. **Basic validation** - Model generates wiki-style documentation
-3. **Foundation to iterate** - Can improve data/training tomorrow
-
-## Tomorrow's Improvements (Optional)
-
-- [ ] Add synthetic data for source traceability (`file:line` refs)
-- [ ] Train for 3 epochs instead of 1
-- [ ] Run proper evaluation
-- [ ] Convert to GGUF for SemanticWiki integration
+## Use with SemanticWiki
 
----
-
-## Troubleshooting
-
-| Issue | Fix |
-|-------|-----|
-| OOM error | Reduce `max_seq_length` to 2048 |
-| Slow download | Model is ~40GB, use fast connection |
-| CUDA error | Update: `pip install torch --upgrade` |
-| Import error | Install missing: `pip install einops` |
-
-## Quick Sanity Check
-
-Before training, verify setup:
 ```bash
-python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, GPU: {torch.cuda.get_device_name(0)}')"
-python -c "from transformers import AutoTokenizer; t = AutoTokenizer.from_pretrained('openai/gpt-oss-20b'); print('Tokenizer OK')"
+# Export to GGUF
+python scripts/merge_and_export.py --to-gguf --quantize q5_k_m
+
+# Use with SemanticWiki
+semanticwiki generate -r ./your-repo \
+  --full-local \
+  --model-path output/semanticwiki-wiki-agent-q5_k_m.gguf
 ```
 
 ---
 
-That's it! ~3 hours from start to a working fine-tuned model.
+**That's it!** ~1 hour from start to a fine-tuned wiki documentation agent.