From d3fefc9b06f97cee3190fdca839ff7e402285fe3 Mon Sep 17 00:00:00 2001 From: Claude Date: Mon, 19 Jan 2026 21:38:09 +0000 Subject: [PATCH 1/3] Add comprehensive fine-tuning plan for gpt-oss-20b wiki agent MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This adds detailed planning documents for fine-tuning OpenAI's gpt-oss-20b to become a specialized architectural wiki agent for SemanticWiki local mode. Documents include: - Dataset preparation: CodeWikiBench, DeepWiki, synthetic data generation - Fine-tuning execution: LoRA config, hyperparameters, training scripts - Evaluation: automated metrics, CodeWikiBench, task-specific evals Target improvements over base model: - Source traceability: 50% → 90%+ - Mermaid diagram validity: 70% → 95%+ - Wiki completeness: 60% → 90%+ --- fine-tuning/01-DATASET-PREPARATION.md | 534 ++++++++++++++++ fine-tuning/02-FINE-TUNING-EXECUTION.md | 696 +++++++++++++++++++++ fine-tuning/03-EVALUATION.md | 792 ++++++++++++++++++++++++ fine-tuning/README.md | 312 ++++++++++ 4 files changed, 2334 insertions(+) create mode 100644 fine-tuning/01-DATASET-PREPARATION.md create mode 100644 fine-tuning/02-FINE-TUNING-EXECUTION.md create mode 100644 fine-tuning/03-EVALUATION.md create mode 100644 fine-tuning/README.md diff --git a/fine-tuning/01-DATASET-PREPARATION.md b/fine-tuning/01-DATASET-PREPARATION.md new file mode 100644 index 0000000..faf0d62 --- /dev/null +++ b/fine-tuning/01-DATASET-PREPARATION.md @@ -0,0 +1,534 @@ +# Dataset Preparation Plan + +This document outlines the strategy for preparing training data to fine-tune gpt-oss-20b as an architectural wiki agent for SemanticWiki in local-only mode. + +## Overview + +The training dataset will combine three sources: +1. **Real examples** from CodeWikiBench and DeepWiki +2. **Synthetic data** generated via LLM distillation +3. **SemanticWiki-specific examples** from existing wiki generations + +Target dataset size: **10,000-50,000 high-quality examples** + +--- + +## 1. Real Data Sources + +### 1.1 CodeWikiBench Dataset + +**Source:** [HuggingFace - anhnh2002/codewikibench](https://huggingface.co/datasets/anhnh2002/codewikibench) + +CodeWikiBench provides repository-level documentation examples across 22 open-source projects in 6 languages. + +#### Dataset Structure +```python +from datasets import load_dataset + +dataset = load_dataset("anhnh2002/codewikibench") +# Each entry contains: +# - repo_name: Repository identifier +# - commit_id: Specific commit hash +# - docs_tree: Original documentation structure +# - structured_docs: Parsed documentation content +# - rubrics: Quality evaluation criteria +``` + +#### Extraction Strategy +```python +# Extract high-quality documentation examples +for repo in dataset['train']: + examples = [] + + # Extract architecture documentation + arch_docs = extract_architecture_sections(repo['structured_docs']) + + # Extract component documentation with source refs + component_docs = extract_component_docs(repo['structured_docs']) + + # Create training pairs: (code_context, documentation) + for doc in arch_docs + component_docs: + examples.append({ + "instruction": generate_instruction(doc), + "input": extract_code_context(repo, doc), + "output": doc['content'] + }) +``` + +#### Languages Covered +| Language | Repositories | Examples | +|----------|-------------|----------| +| JavaScript/TypeScript | Chart.js, puppeteer, mermaid, svelte, marktext, storybook | ~3,000 | +| Python | graphrag, rasa, OpenHands | ~1,500 | +| C/C++ | electron, qmk_firmware, libsql, json, x64dbg | ~2,000 | +| C# | FluentValidation, ml-agents, git-credential-manager | ~1,000 | +| Java | logstash, trino, material-components-android | ~1,000 | + +### 1.2 DeepWiki Crawled Data + +**Source:** [DeepWiki](https://deepwiki.org/) - AI-generated documentation for 30,000+ GitHub repositories + +#### Crawling Strategy +```python +import requests +from bs4 import BeautifulSoup + +def crawl_deepwiki(repo_owner: str, repo_name: str) -> dict: + """ + Crawl DeepWiki documentation for a repository. + Replace 'github.com' with 'deepwiki.com' in URL. + """ + base_url = f"https://deepwiki.com/{repo_owner}/{repo_name}" + + # Fetch main documentation + response = requests.get(base_url) + soup = BeautifulSoup(response.text, 'html.parser') + + return { + "overview": extract_section(soup, "overview"), + "architecture": extract_section(soup, "architecture"), + "components": extract_section(soup, "components"), + "data_flow": extract_section(soup, "data-flow") + } + +# Target repositories (popular, well-structured projects) +TARGET_REPOS = [ + ("facebook", "react"), + ("vuejs", "vue"), + ("microsoft", "vscode"), + ("tensorflow", "tensorflow"), + # ... 500+ curated repositories +] +``` + +#### Data Quality Filters +- Minimum 1,000 lines of code in repository +- Documentation must include architecture diagrams +- Must have source code references +- Exclude auto-generated API docs (focus on conceptual docs) + +### 1.3 OpenDeepWiki (Open Source Alternative) + +**Source:** [GitHub - AIDotNet/OpenDeepWiki](https://github.com/AIDotNet/OpenDeepWiki) + +For repositories where DeepWiki access is limited, use OpenDeepWiki to generate documentation locally. + +--- + +## 2. Synthetic Data Generation + +### 2.1 Distillation from Claude/GPT-4 + +Use a stronger model to generate high-quality documentation examples. + +#### Generation Pipeline +```python +from anthropic import Anthropic + +client = Anthropic() + +def generate_synthetic_example(code_files: list[str], repo_metadata: dict) -> dict: + """ + Generate synthetic architectural documentation using Claude. + """ + prompt = f""" + You are an expert software architect creating documentation for a wiki. + + Repository: {repo_metadata['name']} + Language: {repo_metadata['language']} + + Code files: + {format_code_files(code_files)} + + Generate comprehensive architectural documentation including: + 1. System overview with file:line references + 2. Component descriptions with source traceability + 3. Data flow explanation + 4. Mermaid diagram for architecture + + Format as markdown with `file:line` references for every concept. + """ + + response = client.messages.create( + model="claude-sonnet-4-20250514", + max_tokens=8000, + messages=[{"role": "user", "content": prompt}] + ) + + return { + "instruction": "Generate architectural wiki documentation for this codebase", + "input": format_code_files(code_files), + "output": response.content[0].text + } +``` + +#### Synthetic Data Categories + +| Category | Description | Target Count | +|----------|-------------|--------------| +| Architecture Overview | High-level system design docs | 5,000 | +| Component Documentation | Individual module docs | 10,000 | +| Data Flow Documentation | Request/data lifecycle docs | 3,000 | +| Getting Started Guides | Onboarding documentation | 2,000 | +| Business Domain Mapping | Technical-to-business docs | 2,000 | +| Mermaid Diagram Generation | Architecture diagrams | 5,000 | +| Source Traceability Examples | `file:line` reference patterns | 3,000 | + +### 2.2 Self-Instruct Method + +Generate instruction-following examples by: +1. Seeding with 100 manually-crafted high-quality examples +2. Using gpt-oss-20b (base) to generate variations +3. Filtering with Claude for quality + +```python +def self_instruct_generation(seed_examples: list, num_generate: int = 1000): + """ + Self-instruct style data augmentation. + """ + generated = [] + + for _ in range(num_generate): + # Sample seed examples for context + context_examples = random.sample(seed_examples, k=3) + + # Generate new instruction + new_instruction = generate_instruction_variation(context_examples) + + # Generate response + response = base_model.generate(new_instruction) + + # Quality filter with teacher model + if quality_check(new_instruction, response): + generated.append({ + "instruction": new_instruction, + "output": response + }) + + return generated +``` + +### 2.3 Code-to-Documentation Pairs + +Extract from existing well-documented repositories: + +```python +def extract_code_doc_pairs(repo_path: str) -> list[dict]: + """ + Extract code-documentation pairs from repositories + with inline documentation or adjacent .md files. + """ + pairs = [] + + # Find code files with documentation + for code_file in glob.glob(f"{repo_path}/**/*.ts", recursive=True): + doc_file = code_file.replace('.ts', '.md') + + if os.path.exists(doc_file): + pairs.append({ + "code": read_file(code_file), + "documentation": read_file(doc_file), + "file_path": code_file + }) + + return pairs +``` + +--- + +## 3. SemanticWiki-Specific Data + +### 3.1 Tool Use Trajectories + +Capture successful wiki generation sessions: + +```python +# Format: instruction -> tool calls -> final documentation + +TOOL_USE_EXAMPLE = { + "instruction": "Generate architecture documentation for the authentication module", + "trajectory": [ + {"tool": "search_codebase", "input": "authentication login", "output": "[results]"}, + {"tool": "read_file", "input": "src/auth/provider.ts", "output": "[code]"}, + {"tool": "analyze_code_structure", "input": "src/auth/", "output": "[analysis]"}, + {"tool": "write_wiki_page", "input": {"path": "auth/overview.md", "content": "..."}} + ], + "final_output": "# Authentication Module\n\n..." +} +``` + +### 3.2 Multi-Turn Conversations + +Document iterative refinement patterns: + +```python +MULTI_TURN_EXAMPLE = { + "turns": [ + {"user": "Document the payment processing flow", "assistant": "[initial doc]"}, + {"user": "Add more detail about error handling", "assistant": "[refined doc]"}, + {"user": "Include sequence diagram", "assistant": "[doc with mermaid]"} + ] +} +``` + +### 3.3 Source Traceability Training + +Explicit training on `file:line` reference generation: + +```python +TRACEABILITY_EXAMPLE = { + "instruction": "Add source references to this documentation", + "input": """ + The UserService handles user authentication by validating credentials + against the database and generating JWT tokens. + """, + "output": """ + The `UserService` handles user authentication by validating credentials + against the database ([`src/services/user.ts:45-67`](../src/services/user.ts#L45-L67)) + and generating JWT tokens ([`src/auth/jwt.ts:23-41`](../src/auth/jwt.ts#L23-L41)). + """ +} +``` + +--- + +## 4. Data Format + +### 4.1 Harmony Format for gpt-oss + +gpt-oss models require the [Harmony response format](https://github.com/openai/harmony). + +```python +from openai_harmony import Renderer + +renderer = Renderer() + +def format_for_harmony(example: dict) -> str: + """ + Convert example to Harmony format for gpt-oss training. + """ + messages = [ + { + "role": "system", + "content": WIKI_AGENT_SYSTEM_PROMPT, + "channel": "final" + }, + { + "role": "user", + "content": example["instruction"], + "channel": "final" + } + ] + + # Add tool calls if present + if "trajectory" in example: + for step in example["trajectory"]: + messages.append({ + "role": "assistant", + "content": json.dumps(step), + "channel": "tool_call" + }) + + # Final response + messages.append({ + "role": "assistant", + "content": example["output"], + "channel": "final" + }) + + return renderer.render(messages) +``` + +### 4.2 Alternative: ChatML Format (for Ollama/vLLM) + +```python +def format_chatml(example: dict) -> str: + """ + Standard ChatML format for broader compatibility. + """ + return f"""<|im_start|>system +{WIKI_AGENT_SYSTEM_PROMPT} +<|im_end|> +<|im_start|>user +{example["instruction"]} + +{example.get("input", "")} +<|im_end|> +<|im_start|>assistant +{example["output"]} +<|im_end|>""" +``` + +### 4.3 JSONL Output Format + +Final training data format: + +```jsonl +{"text": "", "source": "codewikibench", "category": "architecture"} +{"text": "", "source": "synthetic", "category": "component"} +{"text": "", "source": "deepwiki", "category": "data_flow"} +``` + +--- + +## 5. Data Quality Assurance + +### 5.1 Automated Quality Checks + +```python +def quality_check(example: dict) -> bool: + """ + Validate training example quality. + """ + checks = [ + # Must have source references + has_source_references(example["output"]), + + # Minimum content length + len(example["output"]) >= 500, + + # Valid markdown + is_valid_markdown(example["output"]), + + # No hallucinated file paths + validate_file_references(example), + + # Proper mermaid syntax (if diagrams present) + validate_mermaid_diagrams(example["output"]), + ] + + return all(checks) + +def has_source_references(text: str) -> bool: + """Check for file:line reference patterns.""" + pattern = r'`[a-zA-Z0-9/_.-]+:\d+(-\d+)?`' + return bool(re.search(pattern, text)) +``` + +### 5.2 Human Review Sample + +- Review 5% of synthetic data manually +- Use LLM-as-judge for automated quality scoring +- Track quality metrics per data source + +### 5.3 Deduplication + +```python +from datasketch import MinHash, MinHashLSH + +def deduplicate_dataset(examples: list[dict]) -> list[dict]: + """ + Remove near-duplicate examples using MinHash LSH. + """ + lsh = MinHashLSH(threshold=0.8, num_perm=128) + unique_examples = [] + + for i, example in enumerate(examples): + minhash = compute_minhash(example["output"]) + + if not lsh.query(minhash): + lsh.insert(f"doc_{i}", minhash) + unique_examples.append(example) + + return unique_examples +``` + +--- + +## 6. Dataset Splits + +| Split | Percentage | Purpose | +|-------|------------|---------| +| Train | 90% | Fine-tuning | +| Validation | 5% | Hyperparameter tuning | +| Test | 5% | Final evaluation | + +### Stratification + +Ensure balanced representation across: +- Programming languages (TypeScript, Python, Java, C++, etc.) +- Documentation types (architecture, component, data flow, guides) +- Repository sizes (small <10K LOC, medium 10-100K, large >100K) +- Data sources (real vs synthetic) + +--- + +## 7. Data Pipeline Implementation + +### 7.1 Directory Structure + +``` +fine-tuning/ +├── data/ +│ ├── raw/ +│ │ ├── codewikibench/ +│ │ ├── deepwiki/ +│ │ └── synthetic/ +│ ├── processed/ +│ │ ├── train.jsonl +│ │ ├── validation.jsonl +│ │ └── test.jsonl +│ └── quality_reports/ +├── scripts/ +│ ├── crawl_deepwiki.py +│ ├── process_codewikibench.py +│ ├── generate_synthetic.py +│ ├── format_harmony.py +│ └── quality_check.py +└── configs/ + └── data_config.yaml +``` + +### 7.2 Pipeline Commands + +```bash +# Step 1: Download CodeWikiBench +python scripts/process_codewikibench.py --output data/raw/codewikibench/ + +# Step 2: Crawl DeepWiki (respect rate limits) +python scripts/crawl_deepwiki.py --repos repos.txt --output data/raw/deepwiki/ + +# Step 3: Generate synthetic data +python scripts/generate_synthetic.py \ + --source-repos /path/to/repos \ + --num-examples 20000 \ + --output data/raw/synthetic/ + +# Step 4: Format for Harmony +python scripts/format_harmony.py \ + --input data/raw/ \ + --output data/processed/ + +# Step 5: Quality check and split +python scripts/quality_check.py \ + --input data/processed/ \ + --output data/processed/ \ + --train-ratio 0.9 \ + --val-ratio 0.05 +``` + +--- + +## 8. Estimated Timeline & Resources + +| Phase | Duration | Compute Required | +|-------|----------|------------------| +| CodeWikiBench processing | 2-4 hours | CPU only | +| DeepWiki crawling | 1-2 days | CPU + network | +| Synthetic generation | 2-3 days | API calls (~$200-500) | +| Quality filtering | 4-8 hours | CPU/GPU for embeddings | +| Formatting & splitting | 1-2 hours | CPU only | + +**Total estimated cost:** $300-600 (primarily synthetic generation API costs) + +--- + +## 9. References + +- [CodeWikiBench Dataset](https://huggingface.co/datasets/anhnh2002/codewikibench) +- [CodeWiki Paper (arXiv:2510.24428)](https://arxiv.org/abs/2510.24428) +- [DeepWiki](https://deepwiki.org/) +- [OpenDeepWiki](https://github.com/AIDotNet/OpenDeepWiki) +- [Harmony Response Format](https://github.com/openai/harmony) +- [Synthetic Data Generation Survey](https://arxiv.org/abs/2503.14023) +- [LLM-Synthetic-Data Reading List](https://github.com/pengr/LLM-Synthetic-Data) diff --git a/fine-tuning/02-FINE-TUNING-EXECUTION.md b/fine-tuning/02-FINE-TUNING-EXECUTION.md new file mode 100644 index 0000000..0f41124 --- /dev/null +++ b/fine-tuning/02-FINE-TUNING-EXECUTION.md @@ -0,0 +1,696 @@ +# Fine-Tuning Execution Plan + +This document details the procedure for fine-tuning gpt-oss-20b to become a specialized architectural wiki agent for SemanticWiki. + +## Overview + +### Model Specifications + +| Property | Value | +|----------|-------| +| Base Model | [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) | +| Architecture | Mixture-of-Experts (MoE) Transformer | +| Total Parameters | 20.9B | +| Active Parameters | 3.6B per token | +| MoE Experts | 32 experts, Top-4 routing | +| Context Length | 128K tokens (native) | +| Quantization | MXFP4 (4.25 bits per parameter) | +| License | Apache 2.0 | + +### Fine-Tuning Approach + +We will use **LoRA (Low-Rank Adaptation)** for parameter-efficient fine-tuning: + +- **Why LoRA:** Reduces memory from 65GB+ to 14-16GB VRAM +- **Target:** Attention and MoE expert layers +- **Expected improvement:** Task-specific optimization without catastrophic forgetting + +--- + +## 1. Hardware Requirements + +### Recommended Configurations + +| Configuration | GPU | VRAM | Training Time (20K examples) | Cost Estimate | +|---------------|-----|------|------------------------------|---------------| +| **Optimal** | H100 SXM 80GB | 80GB | 17-20 minutes | ~$3-5/run | +| **Good** | A100 80GB | 80GB | 25-35 minutes | ~$4-6/run | +| **Acceptable** | RTX 4090 24GB | 24GB | 60-90 minutes | Consumer HW | +| **Budget** | RTX 3090 24GB | 24GB | 90-120 minutes | Consumer HW | + +### Cloud GPU Options + +```bash +# RunPod (recommended for quick experiments) +# H100 SXM: ~$3.89/hr +runpod create --gpu H100_SXM --template pytorch + +# Lambda Labs +# H100: ~$2.49/hr (when available) + +# AWS (SageMaker) +# ml.p5.xlarge (H100): ~$10.98/hr + +# Google Cloud (Vertex AI) +# a3-highgpu-1g (H100): ~$5.07/hr +``` + +### Memory Requirements by Method + +| Method | VRAM Required | Notes | +|--------|---------------|-------| +| Full Fine-tuning | 300GB+ | Multi-GPU required | +| BF16 LoRA | 44GB | Standard training | +| QLoRA (4-bit) | 14-16GB | Unsloth optimized | +| MXFP4 Native | 16GB | gpt-oss native format | + +--- + +## 2. Environment Setup + +### 2.1 Dependencies + +```bash +# Create virtual environment +python -m venv venv +source venv/bin/activate + +# Core dependencies +pip install torch>=2.1.0 --index-url https://download.pytorch.org/whl/cu121 +pip install transformers>=4.40.0 +pip install accelerate>=0.27.0 +pip install peft>=0.10.0 +pip install trl>=0.8.0 +pip install bitsandbytes>=0.43.0 +pip install datasets>=2.18.0 + +# gpt-oss specific +pip install openai-harmony # Harmony format support + +# Optional: Unsloth for memory optimization +pip install unsloth +``` + +### 2.2 requirements.txt + +```text +torch>=2.1.0 +transformers>=4.40.0 +accelerate>=0.27.0 +peft>=0.10.0 +trl>=0.8.0 +bitsandbytes>=0.43.0 +datasets>=2.18.0 +openai-harmony>=1.0.0 +wandb>=0.16.0 +tensorboard>=2.16.0 +einops>=0.7.0 +flash-attn>=2.5.0 +``` + +### 2.3 Model Download + +```python +from transformers import AutoModelForCausalLM, AutoTokenizer + +model_id = "openai/gpt-oss-20b" + +# Download model (will use MXFP4 weights) +tokenizer = AutoTokenizer.from_pretrained(model_id) +model = AutoModelForCausalLM.from_pretrained( + model_id, + torch_dtype="auto", + device_map="auto", + trust_remote_code=True +) +``` + +--- + +## 3. LoRA Configuration + +### 3.1 Target Modules + +gpt-oss-20b uses MoE architecture. Target both attention and expert layers: + +```python +from peft import LoraConfig, get_peft_model + +lora_config = LoraConfig( + r=16, # Rank (8, 16, 32, 64 common choices) + lora_alpha=32, # Scaling factor (typically 2x rank) + lora_dropout=0.05, # Dropout for regularization + + # Target modules for gpt-oss MoE + target_modules=[ + # Attention layers + "q_proj", + "k_proj", + "v_proj", + "o_proj", + + # MoE expert layers (critical for task adaptation) + "gate_proj", + "up_proj", + "down_proj", + + # Router (optional, for expert selection tuning) + # "router", + ], + + bias="none", + task_type="CAUSAL_LM", +) + +model = get_peft_model(model, lora_config) +model.print_trainable_parameters() +# Expected: trainable params: ~50M / 20.9B total (0.24%) +``` + +### 3.2 Rank Selection Guide + +| LoRA Rank | Trainable Params | VRAM Impact | Use Case | +|-----------|------------------|-------------|----------| +| r=8 | ~25M | Minimal | Quick experiments | +| r=16 | ~50M | Low | **Recommended starting point** | +| r=32 | ~100M | Moderate | Complex task adaptation | +| r=64 | ~200M | Higher | Maximum expressiveness | + +--- + +## 4. Training Configuration + +### 4.1 Hyperparameters + +```python +from trl import SFTConfig, SFTTrainer + +training_args = SFTConfig( + # Output + output_dir="./output/semanticwiki-gpt-oss", + run_name="semanticwiki-wiki-agent-v1", + + # Training duration + num_train_epochs=3, + max_steps=-1, # -1 = use epochs + + # Batch size + per_device_train_batch_size=1, # Keep low for long sequences + per_device_eval_batch_size=1, + gradient_accumulation_steps=8, # Effective batch = 8 + + # Learning rate + learning_rate=2e-4, # Higher for LoRA + lr_scheduler_type="cosine_with_min_lr", + lr_scheduler_kwargs={"min_lr": 1e-5}, + warmup_ratio=0.03, + + # Optimization + optim="adamw_torch_fused", + weight_decay=0.01, + max_grad_norm=1.0, + + # Precision + bf16=True, # Use bfloat16 (H100 optimal) + tf32=True, # TensorFloat-32 for matmuls + + # Sequence length + max_seq_length=8192, # Adjust based on VRAM + + # Logging + logging_steps=10, + logging_first_step=True, + report_to=["wandb", "tensorboard"], + + # Evaluation + eval_strategy="steps", + eval_steps=100, + + # Checkpointing + save_strategy="steps", + save_steps=500, + save_total_limit=3, + load_best_model_at_end=True, + metric_for_best_model="eval_loss", + + # Efficiency + gradient_checkpointing=True, + gradient_checkpointing_kwargs={"use_reentrant": False}, + + # Dataset + dataset_text_field="text", + packing=True, # Pack sequences for efficiency +) +``` + +### 4.2 Hyperparameter Tuning Ranges + +| Parameter | Range | Recommended Start | +|-----------|-------|-------------------| +| Learning Rate | 1e-5 to 5e-4 | 2e-4 | +| LoRA Rank | 8 to 64 | 16 | +| LoRA Alpha | 16 to 64 | 32 | +| Batch Size (effective) | 4 to 32 | 8 | +| Epochs | 1 to 5 | 3 | +| Warmup Ratio | 0.01 to 0.1 | 0.03 | +| Weight Decay | 0 to 0.1 | 0.01 | + +--- + +## 5. Training Script + +### 5.1 Full Training Script + +```python +#!/usr/bin/env python3 +""" +Fine-tune gpt-oss-20b for SemanticWiki architectural documentation. +""" + +import torch +from datasets import load_dataset +from transformers import ( + AutoModelForCausalLM, + AutoTokenizer, + BitsAndBytesConfig, +) +from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training +from trl import SFTConfig, SFTTrainer +import wandb + +# Configuration +MODEL_ID = "openai/gpt-oss-20b" +DATASET_PATH = "./data/processed/train.jsonl" +OUTPUT_DIR = "./output/semanticwiki-gpt-oss" + +def main(): + # Initialize wandb + wandb.init( + project="semanticwiki-finetuning", + name="gpt-oss-20b-wiki-agent-v1", + config={ + "model": MODEL_ID, + "lora_r": 16, + "learning_rate": 2e-4, + } + ) + + # Load tokenizer + tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) + tokenizer.pad_token = tokenizer.eos_token + tokenizer.padding_side = "right" + + # Quantization config (for lower VRAM) + bnb_config = BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_quant_type="nf4", + bnb_4bit_compute_dtype=torch.bfloat16, + bnb_4bit_use_double_quant=True, + ) + + # Load model + model = AutoModelForCausalLM.from_pretrained( + MODEL_ID, + quantization_config=bnb_config, + device_map="auto", + trust_remote_code=True, + attn_implementation="flash_attention_2", + ) + + # Prepare for training + model = prepare_model_for_kbit_training(model) + + # LoRA configuration + lora_config = LoraConfig( + r=16, + lora_alpha=32, + lora_dropout=0.05, + target_modules=[ + "q_proj", "k_proj", "v_proj", "o_proj", + "gate_proj", "up_proj", "down_proj", + ], + bias="none", + task_type="CAUSAL_LM", + ) + + model = get_peft_model(model, lora_config) + model.print_trainable_parameters() + + # Load dataset + dataset = load_dataset("json", data_files={ + "train": DATASET_PATH, + "validation": DATASET_PATH.replace("train", "validation"), + }) + + # Training arguments + training_args = SFTConfig( + output_dir=OUTPUT_DIR, + num_train_epochs=3, + per_device_train_batch_size=1, + gradient_accumulation_steps=8, + learning_rate=2e-4, + lr_scheduler_type="cosine_with_min_lr", + warmup_ratio=0.03, + bf16=True, + logging_steps=10, + eval_strategy="steps", + eval_steps=100, + save_strategy="steps", + save_steps=500, + save_total_limit=3, + load_best_model_at_end=True, + gradient_checkpointing=True, + max_seq_length=8192, + dataset_text_field="text", + packing=True, + report_to=["wandb"], + ) + + # Initialize trainer + trainer = SFTTrainer( + model=model, + args=training_args, + train_dataset=dataset["train"], + eval_dataset=dataset["validation"], + tokenizer=tokenizer, + ) + + # Train + trainer.train() + + # Save final model + trainer.save_model(f"{OUTPUT_DIR}/final") + tokenizer.save_pretrained(f"{OUTPUT_DIR}/final") + + # Merge LoRA weights (optional, for deployment) + merged_model = model.merge_and_unload() + merged_model.save_pretrained(f"{OUTPUT_DIR}/merged") + + wandb.finish() + +if __name__ == "__main__": + main() +``` + +### 5.2 Unsloth Optimized Script (Lower VRAM) + +```python +#!/usr/bin/env python3 +""" +Fine-tune gpt-oss-20b with Unsloth for 80% memory reduction. +Runs on 14GB VRAM (RTX 4070, 3090, etc.) +""" + +from unsloth import FastLanguageModel +from datasets import load_dataset +from trl import SFTTrainer, SFTConfig + +# Configuration +MODEL_ID = "openai/gpt-oss-20b" +MAX_SEQ_LENGTH = 8192 + +def main(): + # Load model with Unsloth (native MXFP4 support) + model, tokenizer = FastLanguageModel.from_pretrained( + model_name=MODEL_ID, + max_seq_length=MAX_SEQ_LENGTH, + dtype=None, # Auto-detect + load_in_4bit=True, + ) + + # Add LoRA adapters + model = FastLanguageModel.get_peft_model( + model, + r=16, + lora_alpha=32, + lora_dropout=0.05, + target_modules=[ + "q_proj", "k_proj", "v_proj", "o_proj", + "gate_proj", "up_proj", "down_proj", + ], + bias="none", + use_gradient_checkpointing="unsloth", # 30% more memory efficient + random_state=42, + ) + + # Load dataset + dataset = load_dataset("json", data_files={ + "train": "./data/processed/train.jsonl" + }) + + # Training config + training_args = SFTConfig( + output_dir="./output/semanticwiki-gpt-oss-unsloth", + num_train_epochs=3, + per_device_train_batch_size=2, # Can use larger batch with Unsloth + gradient_accumulation_steps=4, + learning_rate=2e-4, + warmup_ratio=0.03, + bf16=True, + logging_steps=10, + save_strategy="steps", + save_steps=500, + max_seq_length=MAX_SEQ_LENGTH, + dataset_text_field="text", + packing=True, + ) + + # Train + trainer = SFTTrainer( + model=model, + args=training_args, + train_dataset=dataset["train"], + tokenizer=tokenizer, + ) + + trainer.train() + + # Save + model.save_pretrained_merged( + "./output/semanticwiki-gpt-oss-unsloth/merged", + tokenizer, + save_method="merged_16bit", + ) + +if __name__ == "__main__": + main() +``` + +--- + +## 6. Training Monitoring + +### 6.1 Key Metrics to Track + +| Metric | Target | Warning Signs | +|--------|--------|---------------| +| Training Loss | Decreasing steadily | Spikes, plateaus early | +| Validation Loss | Decreasing, close to train | Increasing (overfitting) | +| Learning Rate | Following schedule | N/A | +| GPU Memory | <95% utilization | OOM errors | +| Throughput | Consistent tokens/sec | Degradation | + +### 6.2 Wandb Dashboard Setup + +```python +# Log custom metrics during training +def compute_metrics(eval_preds): + predictions, labels = eval_preds + + # Custom metrics for wiki quality + metrics = { + "has_source_refs": compute_source_ref_ratio(predictions), + "valid_markdown": compute_markdown_validity(predictions), + "mermaid_accuracy": compute_mermaid_accuracy(predictions), + } + + return metrics +``` + +### 6.3 Early Stopping + +```python +from transformers import EarlyStoppingCallback + +trainer = SFTTrainer( + # ... other args ... + callbacks=[ + EarlyStoppingCallback( + early_stopping_patience=3, + early_stopping_threshold=0.001, + ) + ], +) +``` + +--- + +## 7. Post-Training Processing + +### 7.1 Merge LoRA Weights + +```python +from peft import PeftModel + +# Load base model +base_model = AutoModelForCausalLM.from_pretrained(MODEL_ID) + +# Load LoRA adapter +model = PeftModel.from_pretrained(base_model, "./output/semanticwiki-gpt-oss/final") + +# Merge weights +merged_model = model.merge_and_unload() + +# Save merged model +merged_model.save_pretrained("./output/semanticwiki-gpt-oss-merged") +``` + +### 7.2 Convert to GGUF (for SemanticWiki local mode) + +```bash +# Clone llama.cpp +git clone https://github.com/ggerganov/llama.cpp +cd llama.cpp + +# Convert to GGUF +python convert_hf_to_gguf.py \ + ../output/semanticwiki-gpt-oss-merged \ + --outfile ../output/semanticwiki-wiki-agent.gguf \ + --outtype f16 + +# Quantize (optional, for smaller size) +./llama-quantize \ + ../output/semanticwiki-wiki-agent.gguf \ + ../output/semanticwiki-wiki-agent-q5_k_m.gguf \ + q5_k_m +``` + +### 7.3 Upload to Hub (Optional) + +```python +from huggingface_hub import HfApi + +api = HfApi() + +# Upload merged model +api.upload_folder( + folder_path="./output/semanticwiki-gpt-oss-merged", + repo_id="your-org/semanticwiki-wiki-agent", + repo_type="model", +) + +# Upload GGUF +api.upload_file( + path_or_fileobj="./output/semanticwiki-wiki-agent-q5_k_m.gguf", + path_in_repo="semanticwiki-wiki-agent-q5_k_m.gguf", + repo_id="your-org/semanticwiki-wiki-agent-gguf", + repo_type="model", +) +``` + +--- + +## 8. Integration with SemanticWiki + +### 8.1 Using Fine-Tuned Model + +After training, use the model with SemanticWiki: + +```bash +# Option 1: GGUF with local-llama-provider +semanticwiki generate -r ./my-project \ + --full-local \ + --model-path ~/.semanticwiki/models/semanticwiki-wiki-agent-q5_k_m.gguf + +# Option 2: Via Ollama +ollama create semanticwiki-agent -f Modelfile +semanticwiki generate -r ./my-project \ + --full-local --use-ollama --local-model semanticwiki-agent +``` + +### 8.2 Modelfile for Ollama + +```dockerfile +# Modelfile +FROM ./semanticwiki-wiki-agent-q5_k_m.gguf + +TEMPLATE """{{ if .System }}<|start|>system<|channel|>final<|end|> +{{ .System }}<|start|>end<|end|>{{ end }}{{ if .Prompt }}<|start|>user<|channel|>final<|end|> +{{ .Prompt }}<|start|>end<|end|>{{ end }}<|start|>assistant<|channel|>final<|end|> +{{ .Response }}<|start|>end<|end|>""" + +PARAMETER temperature 0.7 +PARAMETER top_p 0.9 +PARAMETER num_ctx 32768 +PARAMETER stop "<|start|>end<|end|>" +``` + +--- + +## 9. Training Time Estimates + +### By Dataset Size (H100 80GB) + +| Examples | Epochs | Estimated Time | Tokens Processed | +|----------|--------|----------------|------------------| +| 5,000 | 3 | ~10 minutes | ~50M | +| 10,000 | 3 | ~17 minutes | ~100M | +| 20,000 | 3 | ~30 minutes | ~200M | +| 50,000 | 3 | ~75 minutes | ~500M | + +### By Hardware (20K examples, 3 epochs) + +| GPU | Time | Cost | +|-----|------|------| +| H100 80GB | 30 min | ~$2-3 | +| A100 80GB | 45 min | ~$3-4 | +| RTX 4090 24GB | 90 min | Consumer | +| RTX 3090 24GB | 120 min | Consumer | + +--- + +## 10. Troubleshooting + +### Common Issues + +| Issue | Cause | Solution | +|-------|-------|----------| +| OOM Error | Batch too large | Reduce `per_device_train_batch_size`, increase `gradient_accumulation_steps` | +| Loss NaN | Learning rate too high | Reduce `learning_rate` to 1e-4 or 5e-5 | +| No improvement | Data quality issues | Review training data, check format | +| Slow training | No Flash Attention | Install `flash-attn`, use `attn_implementation="flash_attention_2"` | +| Harmony format errors | Incorrect tokenization | Use `openai-harmony` library for formatting | + +### Memory Optimization Checklist + +```python +# 1. Enable gradient checkpointing +gradient_checkpointing=True + +# 2. Use 4-bit quantization +load_in_4bit=True + +# 3. Use Unsloth (if available) +from unsloth import FastLanguageModel + +# 4. Reduce sequence length +max_seq_length=4096 # Instead of 8192 + +# 5. Use smaller LoRA rank +r=8 # Instead of 16 + +# 6. Enable CPU offloading +device_map="auto" # Offloads to CPU when needed +``` + +--- + +## 11. References + +- [gpt-oss-20b on HuggingFace](https://huggingface.co/openai/gpt-oss-20b) +- [OpenAI Cookbook: Fine-tuning gpt-oss](https://cookbook.openai.com/articles/gpt-oss/fine-tune-transfomers) +- [Harmony Response Format](https://github.com/openai/harmony) +- [Unsloth Documentation](https://docs.unsloth.ai/models/gpt-oss-how-to-run-and-fine-tune) +- [TRL SFTTrainer](https://huggingface.co/docs/trl/sft_trainer) +- [PEFT LoRA](https://huggingface.co/docs/peft/conceptual_guides/lora) +- [Analytics Vidhya: Fine-tuning gpt-oss](https://www.analyticsvidhya.com/blog/2025/10/finetuning-gpt-oss/) diff --git a/fine-tuning/03-EVALUATION.md b/fine-tuning/03-EVALUATION.md new file mode 100644 index 0000000..ea31261 --- /dev/null +++ b/fine-tuning/03-EVALUATION.md @@ -0,0 +1,792 @@ +# Evaluation Plan + +This document outlines the evaluation methodology to verify that the fine-tuned gpt-oss-20b model improves over the base model for architectural wiki generation in SemanticWiki. + +## Overview + +### Evaluation Goals + +1. **Demonstrate improvement** over base gpt-oss-20b on wiki generation tasks +2. **Measure task-specific capabilities** (source traceability, diagram generation, etc.) +3. **Ensure no regression** on general capabilities +4. **Benchmark against alternatives** (Claude, Qwen 2.5 Coder, DeepWiki) + +### Evaluation Strategy + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Evaluation Pipeline │ +├─────────────────────────────────────────────────────────────────┤ +│ 1. Automated Metrics → BLEU, ROUGE, BERTScore, Custom │ +│ 2. CodeWikiBench → Standardized benchmark comparison │ +│ 3. Task-Specific Evals → Source refs, diagrams, tool use │ +│ 4. End-to-End Testing → Full wiki generation on real repos │ +│ 5. Human Evaluation → Expert review of generated wikis │ +└─────────────────────────────────────────────────────────────────┘ +``` + +--- + +## 1. Automated Metrics + +### 1.1 Standard NLG Metrics + +These metrics compare generated documentation against reference documentation. + +```python +from evaluate import load +import numpy as np + +# Load metrics +bleu = load("bleu") +rouge = load("rouge") +bertscore = load("bertscore") + +def compute_standard_metrics(predictions: list[str], references: list[str]) -> dict: + """ + Compute standard NLG metrics for documentation quality. + """ + results = {} + + # BLEU (n-gram precision) + bleu_result = bleu.compute(predictions=predictions, references=references) + results["bleu"] = bleu_result["bleu"] + + # ROUGE (recall-oriented) + rouge_result = rouge.compute(predictions=predictions, references=references) + results["rouge1"] = rouge_result["rouge1"] + results["rouge2"] = rouge_result["rouge2"] + results["rougeL"] = rouge_result["rougeL"] + + # BERTScore (semantic similarity) + bertscore_result = bertscore.compute( + predictions=predictions, + references=references, + lang="en", + model_type="microsoft/deberta-xlarge-mnli" + ) + results["bertscore_f1"] = np.mean(bertscore_result["f1"]) + + return results +``` + +### 1.2 Target Scores + +| Metric | Base gpt-oss-20b | Target (Fine-tuned) | Improvement | +|--------|------------------|---------------------|-------------| +| BLEU | ~0.15 | >0.25 | +67% | +| ROUGE-L | ~0.35 | >0.50 | +43% | +| BERTScore F1 | ~0.70 | >0.80 | +14% | + +### 1.3 Limitations of Standard Metrics + +Standard metrics have known limitations for documentation: +- BLEU penalizes valid paraphrasing +- ROUGE doesn't capture semantic correctness +- BERTScore may miss structural quality + +**Recommendation:** Use standard metrics as a baseline, but rely more heavily on task-specific and human evaluation. + +--- + +## 2. CodeWikiBench Evaluation + +### 2.1 Benchmark Overview + +[CodeWikiBench](https://github.com/FSoft-AI4Code/CodeWikiBench) is the first benchmark for repository-level documentation quality. + +- **Repositories:** 22 projects across 6 languages +- **Rubrics:** Hierarchical quality assessment criteria +- **Baseline:** DeepWiki achieves 68.79% with proprietary models + +### 2.2 Running CodeWikiBench + +```bash +# Clone benchmark +git clone https://github.com/FSoft-AI4Code/CodeWikiBench +cd CodeWikiBench + +# Install dependencies +pip install -r requirements.txt + +# Load dataset +python -c " +from datasets import load_dataset +dataset = load_dataset('anhnh2002/codewikibench') +print(f'Loaded {len(dataset[\"train\"])} repositories') +" +``` + +### 2.3 Evaluation Script + +```python +from datasets import load_dataset +from codewikibench import evaluate_documentation + +def evaluate_on_codewikibench(model, tokenizer, num_repos: int = 5): + """ + Evaluate model on CodeWikiBench subset. + """ + dataset = load_dataset("anhnh2002/codewikibench") + + results = [] + for repo in dataset["train"][:num_repos]: + # Generate documentation + generated_docs = generate_wiki_for_repo( + model, tokenizer, + repo_name=repo["repo_name"], + commit_id=repo["commit_id"] + ) + + # Evaluate against rubrics + scores = evaluate_documentation( + generated=generated_docs, + reference=repo["structured_docs"], + rubrics=repo["rubrics"] + ) + + results.append({ + "repo": repo["repo_name"], + "scores": scores + }) + + return aggregate_results(results) +``` + +### 2.4 Target Performance + +| Model | CodeWikiBench Score | Notes | +|-------|---------------------|-------| +| DeepWiki (baseline) | 68.79% | Proprietary | +| CodeWiki (open) | 64.80% | Open-source | +| gpt-oss-20b (base) | ~55-60% | Estimated | +| **gpt-oss-20b (fine-tuned)** | **>70%** | **Target** | + +--- + +## 3. Task-Specific Evaluation + +### 3.1 Source Traceability Score + +Measures the model's ability to generate accurate `file:line` references. + +```python +import re +from pathlib import Path + +def evaluate_source_traceability( + generated_doc: str, + repo_path: str +) -> dict: + """ + Evaluate source reference accuracy. + """ + # Extract file:line references + pattern = r'`([a-zA-Z0-9/_.-]+):(\d+)(?:-(\d+))?`' + references = re.findall(pattern, generated_doc) + + total_refs = len(references) + valid_refs = 0 + invalid_refs = [] + + for file_path, start_line, end_line in references: + full_path = Path(repo_path) / file_path + + if full_path.exists(): + lines = full_path.read_text().split('\n') + start = int(start_line) + end = int(end_line) if end_line else start + + # Check line numbers are valid + if 1 <= start <= len(lines) and 1 <= end <= len(lines): + valid_refs += 1 + else: + invalid_refs.append(f"{file_path}:{start_line} (line out of range)") + else: + invalid_refs.append(f"{file_path} (file not found)") + + return { + "total_references": total_refs, + "valid_references": valid_refs, + "accuracy": valid_refs / total_refs if total_refs > 0 else 0, + "invalid_refs": invalid_refs, + "has_references": total_refs > 0 + } +``` + +**Target Scores:** + +| Metric | Base | Target | +|--------|------|--------| +| Reference Accuracy | <50% | >90% | +| References per 1K words | ~2 | >10 | +| File existence accuracy | ~60% | >95% | + +### 3.2 Mermaid Diagram Quality + +Evaluate generated architecture diagrams. + +```python +import subprocess +import tempfile + +def evaluate_mermaid_diagrams(generated_doc: str) -> dict: + """ + Extract and validate Mermaid diagrams. + """ + # Extract mermaid blocks + mermaid_pattern = r'```mermaid\n(.*?)```' + diagrams = re.findall(mermaid_pattern, generated_doc, re.DOTALL) + + results = { + "total_diagrams": len(diagrams), + "valid_syntax": 0, + "renders_successfully": 0, + "diagram_types": [], + } + + for diagram in diagrams: + # Check syntax validity + if validate_mermaid_syntax(diagram): + results["valid_syntax"] += 1 + + # Check rendering + if render_mermaid(diagram): + results["renders_successfully"] += 1 + + # Identify diagram type + diagram_type = identify_diagram_type(diagram) + results["diagram_types"].append(diagram_type) + + results["syntax_accuracy"] = ( + results["valid_syntax"] / results["total_diagrams"] + if results["total_diagrams"] > 0 else 0 + ) + + return results + +def validate_mermaid_syntax(diagram: str) -> bool: + """Validate Mermaid diagram syntax using mmdc CLI.""" + try: + with tempfile.NamedTemporaryFile(mode='w', suffix='.mmd') as f: + f.write(diagram) + f.flush() + result = subprocess.run( + ['mmdc', '-i', f.name, '-o', '/dev/null', '--quiet'], + capture_output=True, + timeout=10 + ) + return result.returncode == 0 + except Exception: + return False + +def identify_diagram_type(diagram: str) -> str: + """Identify the type of Mermaid diagram.""" + first_line = diagram.strip().split('\n')[0].lower() + if 'flowchart' in first_line or 'graph' in first_line: + return 'flowchart' + elif 'sequencediagram' in first_line or 'sequence' in first_line: + return 'sequence' + elif 'classdiagram' in first_line or 'class' in first_line: + return 'class' + elif 'erdiagram' in first_line or 'er' in first_line: + return 'er' + elif 'statediagram' in first_line or 'state' in first_line: + return 'state' + else: + return 'unknown' +``` + +**Target Scores:** + +| Metric | Base | Target | +|--------|------|--------| +| Diagrams per wiki | ~0.5 | >3 | +| Syntax validity | ~70% | >95% | +| Renders successfully | ~60% | >90% | + +### 3.3 Tool Use Accuracy + +Evaluate the model's ability to use SemanticWiki tools correctly. + +```python +def evaluate_tool_use( + model_outputs: list[dict], + expected_tools: list[str] +) -> dict: + """ + Evaluate tool calling accuracy. + """ + results = { + "total_tool_calls": 0, + "valid_tool_calls": 0, + "invalid_tool_calls": [], + "tools_used": set(), + "expected_tools_used": 0, + } + + for output in model_outputs: + if "tool_calls" in output: + for call in output["tool_calls"]: + results["total_tool_calls"] += 1 + results["tools_used"].add(call["name"]) + + if validate_tool_call(call): + results["valid_tool_calls"] += 1 + else: + results["invalid_tool_calls"].append(call) + + # Check expected tools were used + for expected in expected_tools: + if expected in results["tools_used"]: + results["expected_tools_used"] += 1 + + results["tool_accuracy"] = ( + results["valid_tool_calls"] / results["total_tool_calls"] + if results["total_tool_calls"] > 0 else 0 + ) + + results["expected_coverage"] = ( + results["expected_tools_used"] / len(expected_tools) + if expected_tools else 1 + ) + + return results + +EXPECTED_WIKI_TOOLS = [ + "search_codebase", + "read_file", + "analyze_code_structure", + "write_wiki_page", + "verify_wiki_completeness" +] +``` + +**Target Scores:** + +| Metric | Base | Target | +|--------|------|--------| +| Tool call validity | ~80% | >95% | +| Expected tools used | ~60% | >90% | +| Tool call efficiency | N/A | <20 calls per page | + +### 3.4 Documentation Completeness + +Check that generated wikis cover all required sections. + +```python +def evaluate_completeness(wiki_output: dict) -> dict: + """ + Evaluate wiki completeness against expected structure. + """ + expected_sections = { + "architecture_overview": False, + "component_docs": False, + "data_flow": False, + "getting_started": False, + "mermaid_diagrams": False, + "source_references": False, + "internal_links": False, + } + + # Check each section + for page in wiki_output.get("pages", []): + content = page.get("content", "") + + if "architecture" in page["path"].lower(): + expected_sections["architecture_overview"] = True + + if "component" in page["path"].lower() or "/components/" in page["path"]: + expected_sections["component_docs"] = True + + if "data" in content.lower() and "flow" in content.lower(): + expected_sections["data_flow"] = True + + if "getting started" in content.lower() or "quickstart" in content.lower(): + expected_sections["getting_started"] = True + + if "```mermaid" in content: + expected_sections["mermaid_diagrams"] = True + + if re.search(r'`[a-zA-Z0-9/_.-]+:\d+', content): + expected_sections["source_references"] = True + + if re.search(r'\[.*?\]\(\.\./.*?\.md\)', content): + expected_sections["internal_links"] = True + + completeness_score = sum(expected_sections.values()) / len(expected_sections) + + return { + "sections": expected_sections, + "completeness_score": completeness_score, + "missing": [k for k, v in expected_sections.items() if not v] + } +``` + +**Target Scores:** + +| Metric | Base | Target | +|--------|------|--------| +| Section completeness | ~60% | >90% | +| All required sections | No | Yes | + +--- + +## 4. End-to-End Evaluation + +### 4.1 Test Repository Suite + +Create a diverse set of test repositories: + +| Repository | Language | Size | Complexity | Purpose | +|------------|----------|------|------------|---------| +| simple-api | TypeScript | 2K LOC | Low | Baseline test | +| react-dashboard | TypeScript | 15K LOC | Medium | Frontend patterns | +| fastapi-backend | Python | 10K LOC | Medium | Backend patterns | +| microservices-demo | Go | 25K LOC | High | Distributed systems | +| monorepo-example | Mixed | 50K LOC | High | Large codebase | + +### 4.2 End-to-End Test Script + +```python +import subprocess +import time +from pathlib import Path + +def run_e2e_evaluation( + model_path: str, + test_repos: list[str], + output_dir: str +) -> dict: + """ + Run end-to-end wiki generation evaluation. + """ + results = [] + + for repo_path in test_repos: + repo_name = Path(repo_path).name + wiki_output = Path(output_dir) / repo_name + + # Time the generation + start_time = time.time() + + # Run SemanticWiki with fine-tuned model + result = subprocess.run([ + "semanticwiki", "generate", + "-r", repo_path, + "--full-local", + "--model-path", model_path, + "--output", str(wiki_output) + ], capture_output=True, text=True) + + generation_time = time.time() - start_time + + # Evaluate output + if result.returncode == 0: + wiki_quality = evaluate_wiki_output(wiki_output, repo_path) + else: + wiki_quality = {"error": result.stderr} + + results.append({ + "repo": repo_name, + "success": result.returncode == 0, + "generation_time": generation_time, + "quality": wiki_quality + }) + + return aggregate_e2e_results(results) + +def evaluate_wiki_output(wiki_path: Path, repo_path: str) -> dict: + """ + Comprehensive evaluation of generated wiki. + """ + wiki_content = load_wiki(wiki_path) + + return { + "traceability": evaluate_source_traceability( + wiki_content["full_text"], repo_path + ), + "diagrams": evaluate_mermaid_diagrams(wiki_content["full_text"]), + "completeness": evaluate_completeness(wiki_content), + "broken_links": check_broken_links(wiki_path), + "word_count": count_words(wiki_content["full_text"]), + "page_count": len(wiki_content["pages"]), + } +``` + +### 4.3 Performance Benchmarks + +| Metric | Base gpt-oss-20b | Target (Fine-tuned) | +|--------|------------------|---------------------| +| Generation time (10K LOC) | ~15 min | <10 min | +| Token efficiency | ~50K tokens | <30K tokens | +| Retry rate | ~30% | <10% | +| Success rate | ~80% | >95% | + +--- + +## 5. Human Evaluation + +### 5.1 Evaluation Rubric + +Expert reviewers rate generated documentation on: + +| Criterion | Weight | Description | +|-----------|--------|-------------| +| **Accuracy** | 25% | Technical correctness of descriptions | +| **Completeness** | 20% | Coverage of system architecture | +| **Traceability** | 20% | Quality of source code references | +| **Clarity** | 15% | Readability and organization | +| **Diagrams** | 10% | Quality of visual representations | +| **Usefulness** | 10% | Would a developer find this helpful? | + +### 5.2 Evaluation Protocol + +```markdown +## Human Evaluation Instructions + +For each generated wiki, evaluate on a scale of 1-5: + +### 1. Accuracy (1-5) +- Does the documentation correctly describe the code? +- Are technical details accurate? +- Are there any factual errors? + +### 2. Completeness (1-5) +- Are all major components documented? +- Is the architecture overview comprehensive? +- Are data flows explained? + +### 3. Traceability (1-5) +- Are source references provided? +- Do file:line references point to correct locations? +- Can you navigate from docs to code easily? + +### 4. Clarity (1-5) +- Is the writing clear and professional? +- Is the structure logical? +- Is technical jargon explained? + +### 5. Diagrams (1-5) +- Are diagrams relevant and accurate? +- Do they aid understanding? +- Are they properly formatted? + +### 6. Usefulness (1-5) +- Would this help a new developer onboard? +- Does it explain the "why" not just the "what"? +- Would you recommend this documentation? +``` + +### 5.3 Sample Size + +- **Minimum:** 20 wiki generations (4 reviewers × 5 repos) +- **Recommended:** 50 wiki generations (5 reviewers × 10 repos) +- **Statistical power:** Detect 0.5 point improvement with 95% confidence + +### 5.4 Inter-Rater Reliability + +Calculate Krippendorff's alpha to ensure reviewer agreement: + +```python +import krippendorff + +def calculate_inter_rater_reliability(ratings: list[list[float]]) -> float: + """ + Calculate inter-rater reliability using Krippendorff's alpha. + """ + alpha = krippendorff.alpha( + reliability_data=ratings, + level_of_measurement="interval" + ) + return alpha + +# Target: α > 0.7 (acceptable agreement) +``` + +--- + +## 6. Comparison Baselines + +### 6.1 Models to Compare + +| Model | Type | Purpose | +|-------|------|---------| +| gpt-oss-20b (base) | Open | Primary baseline | +| gpt-oss-20b (fine-tuned) | Open | Our model | +| Claude Sonnet | Proprietary | Quality ceiling | +| Qwen 2.5 Coder 14B | Open | Current SemanticWiki local | +| DeepWiki | Proprietary | Specialized baseline | + +### 6.2 Comparison Script + +```python +def compare_models( + test_repos: list[str], + models: dict[str, callable] +) -> dict: + """ + Compare multiple models on the same test set. + """ + results = {model_name: [] for model_name in models} + + for repo in test_repos: + for model_name, generate_fn in models.items(): + # Generate wiki + wiki = generate_fn(repo) + + # Evaluate + scores = { + "traceability": evaluate_source_traceability(wiki, repo), + "diagrams": evaluate_mermaid_diagrams(wiki), + "completeness": evaluate_completeness(wiki), + "standard_metrics": compute_standard_metrics([wiki], [reference]) + } + + results[model_name].append(scores) + + return aggregate_comparison(results) +``` + +### 6.3 Expected Results + +| Model | Source Refs | Diagrams | Completeness | Overall | +|-------|-------------|----------|--------------|---------| +| gpt-oss-20b (base) | 50% | 70% | 60% | 58% | +| **gpt-oss-20b (fine-tuned)** | **92%** | **95%** | **90%** | **85%** | +| Claude Sonnet | 85% | 90% | 88% | 87% | +| Qwen 2.5 Coder 14B | 65% | 75% | 70% | 68% | + +--- + +## 7. Regression Testing + +### 7.1 General Capability Tests + +Ensure fine-tuning doesn't harm general capabilities: + +```python +def test_general_capabilities(model, tokenizer) -> dict: + """ + Test that general capabilities are preserved. + """ + tests = { + "code_completion": test_code_completion(model, tokenizer), + "code_explanation": test_code_explanation(model, tokenizer), + "bug_detection": test_bug_detection(model, tokenizer), + "refactoring": test_refactoring(model, tokenizer), + } + + return tests + +def test_code_completion(model, tokenizer) -> float: + """ + Test code completion on HumanEval-style problems. + """ + from human_eval import evaluate_functional_correctness + + # Generate completions + completions = generate_completions(model, tokenizer, HUMANEVAL_PROBLEMS) + + # Evaluate + results = evaluate_functional_correctness(completions) + + return results["pass@1"] +``` + +### 7.2 Regression Thresholds + +| Capability | Base Score | Min Acceptable | +|------------|------------|----------------| +| HumanEval pass@1 | 45% | >40% | +| MBPP pass@1 | 55% | >50% | +| Code explanation | 80% | >75% | + +--- + +## 8. Evaluation Pipeline + +### 8.1 Directory Structure + +``` +fine-tuning/ +├── evaluation/ +│ ├── scripts/ +│ │ ├── run_standard_metrics.py +│ │ ├── run_codewikibench.py +│ │ ├── run_task_specific.py +│ │ ├── run_e2e.py +│ │ └── run_human_eval.py +│ ├── test_repos/ +│ │ ├── simple-api/ +│ │ ├── react-dashboard/ +│ │ └── ... +│ ├── results/ +│ │ ├── base_model/ +│ │ └── finetuned_model/ +│ └── reports/ +│ └── evaluation_report.md +└── configs/ + └── eval_config.yaml +``` + +### 8.2 Full Evaluation Command + +```bash +# Run complete evaluation pipeline +python evaluation/scripts/run_all_evaluations.py \ + --base-model openai/gpt-oss-20b \ + --finetuned-model ./output/semanticwiki-gpt-oss-merged \ + --test-repos ./evaluation/test_repos \ + --output ./evaluation/results \ + --report ./evaluation/reports/evaluation_report.md +``` + +### 8.3 Evaluation Timeline + +| Phase | Duration | Dependencies | +|-------|----------|--------------| +| Standard metrics | 1-2 hours | Test set ready | +| CodeWikiBench | 2-4 hours | Benchmark setup | +| Task-specific | 2-3 hours | Test repos ready | +| End-to-end | 4-8 hours | Full pipeline | +| Human evaluation | 2-3 days | Evaluators recruited | + +--- + +## 9. Success Criteria + +### 9.1 Minimum Viable Improvement + +The fine-tuned model must demonstrate: + +| Metric | Requirement | +|--------|-------------| +| Source traceability | >85% accuracy | +| Mermaid validity | >90% | +| Wiki completeness | >85% | +| CodeWikiBench score | >65% | +| Human eval (overall) | >4.0/5.0 | +| No capability regression | >90% of base | + +### 9.2 Target Goals + +| Metric | Target | +|--------|--------| +| Source traceability | >92% accuracy | +| Mermaid validity | >95% | +| Wiki completeness | >90% | +| CodeWikiBench score | >70% | +| Human eval (overall) | >4.3/5.0 | +| Generation speed | 50% faster than base | + +--- + +## 10. References + +- [CodeWikiBench](https://github.com/FSoft-AI4Code/CodeWikiBench) +- [CodeWiki Paper](https://arxiv.org/abs/2510.24428) +- [LLM Evaluation Guide](https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation) +- [BERTScore](https://github.com/Tiiiger/bert_score) +- [HumanEval](https://github.com/openai/human-eval) +- [Krippendorff's Alpha](https://github.com/pln-fing-udelar/fast-krippendorff) diff --git a/fine-tuning/README.md b/fine-tuning/README.md new file mode 100644 index 0000000..5672db0 --- /dev/null +++ b/fine-tuning/README.md @@ -0,0 +1,312 @@ +# Fine-Tuning gpt-oss-20b for SemanticWiki + +This directory contains comprehensive planning documents for fine-tuning OpenAI's gpt-oss-20b to become a specialized architectural wiki agent for SemanticWiki in local-only mode. + +## Project Goal + +Create a fine-tuned version of gpt-oss-20b that excels at: +- Generating architectural documentation with source traceability +- Creating accurate Mermaid diagrams +- Using SemanticWiki tools effectively +- Producing complete, well-structured wiki pages + +## Quick Start + +```bash +# 1. Prepare dataset +python scripts/prepare_dataset.py + +# 2. Fine-tune model +python scripts/train.py --config configs/train_config.yaml + +# 3. Evaluate +python scripts/evaluate.py --model ./output/merged + +# 4. Use with SemanticWiki +semanticwiki generate -r ./your-repo --full-local --model-path ./output/model.gguf +``` + +## Plan Documents + +| Document | Description | +|----------|-------------| +| [01-DATASET-PREPARATION.md](./01-DATASET-PREPARATION.md) | Dataset collection, synthesis, and formatting | +| [02-FINE-TUNING-EXECUTION.md](./02-FINE-TUNING-EXECUTION.md) | Training procedure, hyperparameters, scripts | +| [03-EVALUATION.md](./03-EVALUATION.md) | Evaluation metrics, benchmarks, success criteria | + +--- + +## Executive Summary + +### Model Selection: gpt-oss-20b + +| Property | Value | +|----------|-------| +| Parameters | 20.9B total, 3.6B active (MoE) | +| Architecture | Mixture-of-Experts Transformer | +| Context Length | 128K tokens | +| Quantization | MXFP4 (fits in 16GB VRAM) | +| License | Apache 2.0 | + +**Why gpt-oss-20b?** +- Strong reasoning capabilities from OpenAI training +- Efficient MoE architecture for fast inference +- Runs on consumer hardware with quantization +- Apache 2.0 license allows commercial use + +### Training Data Strategy + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Training Data Mix │ +├─────────────────────────────────────────────────────────────────┤ +│ Real Data (40%) │ +│ ├─ CodeWikiBench: 22 repos, ~8K examples │ +│ └─ DeepWiki crawl: 500+ repos, ~15K examples │ +│ │ +│ Synthetic Data (50%) │ +│ ├─ Claude-distilled: ~20K architecture docs │ +│ ├─ Self-instruct: ~5K variations │ +│ └─ Tool-use trajectories: ~5K examples │ +│ │ +│ SemanticWiki-Specific (10%) │ +│ ├─ Source traceability examples: ~3K │ +│ └─ Multi-turn refinement: ~2K │ +├─────────────────────────────────────────────────────────────────┤ +│ Total: 50,000+ high-quality examples │ +└─────────────────────────────────────────────────────────────────┘ +``` + +### Hardware Requirements + +| Configuration | VRAM | Training Time (50K examples) | Estimated Cost | +|---------------|------|------------------------------|----------------| +| H100 80GB | 80GB | 1.5-2 hours | ~$8-12 | +| A100 80GB | 80GB | 2-3 hours | ~$10-15 | +| RTX 4090 (Unsloth) | 24GB | 4-6 hours | Consumer HW | + +### Expected Improvements + +| Metric | Base gpt-oss-20b | Fine-tuned Target | +|--------|------------------|-------------------| +| Source traceability | ~50% | >90% | +| Mermaid diagram validity | ~70% | >95% | +| Wiki completeness | ~60% | >90% | +| CodeWikiBench score | ~55% | >70% | +| Tool use accuracy | ~80% | >95% | +| Generation efficiency | Baseline | 2x faster | + +--- + +## Timeline Overview + +### Phase 1: Data Preparation (3-5 days) + +| Task | Duration | Output | +|------|----------|--------| +| Download CodeWikiBench | 1 hour | Raw dataset | +| Crawl DeepWiki | 1-2 days | 15K+ examples | +| Generate synthetic data | 2-3 days | 30K+ examples | +| Quality filtering | 4-8 hours | Clean dataset | +| Format to Harmony | 1-2 hours | train.jsonl | + +**Estimated cost:** $300-600 (synthetic generation API calls) + +### Phase 2: Fine-Tuning (4-8 hours) + +| Task | Duration | Output | +|------|----------|--------| +| Environment setup | 30 min | Ready to train | +| Training run | 1.5-3 hours | LoRA weights | +| Merge weights | 15 min | Merged model | +| Convert to GGUF | 30 min | Deployable model | + +**Estimated cost:** $10-20 (cloud GPU) + +### Phase 3: Evaluation (2-5 days) + +| Task | Duration | Output | +|------|----------|--------| +| Automated metrics | 2-4 hours | Metric scores | +| CodeWikiBench | 4-8 hours | Benchmark results | +| End-to-end tests | 8-12 hours | Wiki samples | +| Human evaluation | 2-3 days | Expert ratings | + +**Estimated cost:** $0-100 (compute + optional human eval) + +### Total Timeline + +- **Minimum:** 5-7 days +- **Recommended:** 10-14 days (including iteration) +- **Total estimated cost:** $350-750 + +--- + +## Key Technical Decisions + +### 1. LoRA vs Full Fine-Tuning + +**Decision:** Use LoRA (Low-Rank Adaptation) + +- Reduces VRAM from 300GB+ to 14-44GB +- Preserves base model capabilities +- Enables consumer hardware training +- Faster training iterations + +### 2. Data Format: Harmony + +**Decision:** Use OpenAI's Harmony response format + +```python +from openai_harmony import Renderer + +# gpt-oss requires Harmony format for correct behavior +renderer = Renderer() +formatted = renderer.render(messages) +``` + +### 3. Quantization: MXFP4 → GGUF + +**Decision:** Train in native format, export to GGUF + +- Train with MXFP4 (gpt-oss native) +- Export to GGUF for llama.cpp / Ollama compatibility +- Enables SemanticWiki local-only mode + +### 4. Evaluation: Multi-Modal Approach + +**Decision:** Combine automated + human evaluation + +- Standard metrics (BLEU, ROUGE) as baseline +- Task-specific metrics (source refs, diagrams) as primary +- CodeWikiBench for standardized comparison +- Human evaluation for quality assurance + +--- + +## Integration with SemanticWiki + +After training, the fine-tuned model integrates seamlessly: + +```bash +# Option 1: Direct GGUF +semanticwiki generate -r ./my-repo \ + --full-local \ + --model-path ~/.semanticwiki/models/semanticwiki-wiki-agent.gguf + +# Option 2: Via Ollama +ollama create semanticwiki-agent -f Modelfile +semanticwiki generate -r ./my-repo \ + --full-local --use-ollama --local-model semanticwiki-agent +``` + +The model will be loaded by `LocalLlamaProvider` or `OllamaProvider` in the SemanticWiki architecture: + +``` +CLI (--model-path) + ↓ +createLLMProvider() factory + ↓ +LocalLlamaProvider / OllamaProvider + ↓ +WikiAgent (uses fine-tuned model) +``` + +--- + +## Success Criteria + +### Minimum Viable Product + +- [ ] Source traceability accuracy >85% +- [ ] Mermaid diagram validity >90% +- [ ] Wiki completeness >85% +- [ ] No regression on general capabilities +- [ ] Works in SemanticWiki local mode + +### Stretch Goals + +- [ ] CodeWikiBench score >70% (beat DeepWiki) +- [ ] Human eval rating >4.3/5.0 +- [ ] Generation speed 2x faster than base +- [ ] Support for 10+ programming languages + +--- + +## References + +### Primary Sources + +- [gpt-oss-20b on HuggingFace](https://huggingface.co/openai/gpt-oss-20b) +- [OpenAI Cookbook: Fine-tuning gpt-oss](https://cookbook.openai.com/articles/gpt-oss/fine-tune-transfomers) +- [Harmony Response Format](https://github.com/openai/harmony) +- [Unsloth Documentation](https://docs.unsloth.ai/models/gpt-oss-how-to-run-and-fine-tune) + +### Datasets + +- [CodeWikiBench](https://huggingface.co/datasets/anhnh2002/codewikibench) +- [CodeWiki Paper](https://arxiv.org/abs/2510.24428) +- [DeepWiki](https://deepwiki.org/) +- [OpenDeepWiki](https://github.com/AIDotNet/OpenDeepWiki) + +### Training Resources + +- [TRL SFTTrainer](https://huggingface.co/docs/trl/sft_trainer) +- [PEFT LoRA Guide](https://huggingface.co/docs/peft/conceptual_guides/lora) +- [Synthetic Data Survey](https://arxiv.org/abs/2503.14023) + +### Evaluation + +- [LLM Evaluation Guide](https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation) +- [CodeWikiBench GitHub](https://github.com/FSoft-AI4Code/CodeWikiBench) + +--- + +## Directory Structure + +``` +fine-tuning/ +├── README.md # This file +├── 01-DATASET-PREPARATION.md # Dataset plan +├── 02-FINE-TUNING-EXECUTION.md # Training plan +├── 03-EVALUATION.md # Evaluation plan +├── configs/ # Configuration files +│ ├── train_config.yaml +│ ├── data_config.yaml +│ └── eval_config.yaml +├── scripts/ # Implementation scripts +│ ├── prepare_dataset.py +│ ├── train.py +│ ├── evaluate.py +│ └── convert_gguf.py +├── data/ # Training data +│ ├── raw/ +│ ├── processed/ +│ └── quality_reports/ +├── evaluation/ # Evaluation resources +│ ├── test_repos/ +│ ├── results/ +│ └── reports/ +└── output/ # Training outputs + ├── checkpoints/ + ├── merged/ + └── gguf/ +``` + +--- + +## Next Steps + +1. **Review plans** - Ensure all requirements are captured +2. **Set up environment** - Install dependencies, get GPU access +3. **Prepare data** - Run dataset preparation pipeline +4. **Train model** - Execute fine-tuning with monitoring +5. **Evaluate** - Run full evaluation suite +6. **Iterate** - Refine based on results +7. **Deploy** - Integrate with SemanticWiki + +--- + +## Contact + +For questions about this fine-tuning project, refer to the SemanticWiki documentation or open an issue in the repository. From 35764100650450be0ac6a8de28f4f4241dd08b82 Mon Sep 17 00:00:00 2001 From: Claude Date: Mon, 19 Jan 2026 22:33:33 +0000 Subject: [PATCH 2/3] Add quick-start guide for same-night fine-tuning Adds QUICK-START.md with a minimal 3-4 hour plan: - Uses only CodeWikiBench (no crawling/synthetic gen) - Single epoch LoRA training - Simple test script to verify output - Cloud GPU options for those without local hardware --- fine-tuning/QUICK-START.md | 285 +++++++++++++++++++++++++++++++++++++ fine-tuning/README.md | 15 +- 2 files changed, 299 insertions(+), 1 deletion(-) create mode 100644 fine-tuning/QUICK-START.md diff --git a/fine-tuning/QUICK-START.md b/fine-tuning/QUICK-START.md new file mode 100644 index 0000000..94e5c0c --- /dev/null +++ b/fine-tuning/QUICK-START.md @@ -0,0 +1,285 @@ +# Tonight's Plan: Quick Fine-Tune gpt-oss-20b + +A minimal, achievable plan to fine-tune gpt-oss-20b for SemanticWiki in one evening (~3-4 hours). + +## Prerequisites + +- [ ] GPU with 24GB+ VRAM (RTX 3090/4090) OR cloud GPU access (RunPod/Lambda) +- [ ] Python 3.10+ +- [ ] ~$5-10 for cloud GPU (if not using local) + +## Timeline + +| Phase | Time | Task | +|-------|------|------| +| Setup | 20 min | Install deps, download model | +| Data | 30 min | Download CodeWikiBench, format | +| Train | 1-2 hrs | Run LoRA fine-tuning | +| Test | 30 min | Generate wiki, check quality | + +--- + +## Step 1: Environment Setup (20 min) + +```bash +# Create environment +python -m venv venv +source venv/bin/activate + +# Install dependencies +pip install torch transformers accelerate peft trl datasets bitsandbytes + +# Optional: Unsloth for 2x speed (recommended) +pip install unsloth +``` + +## Step 2: Download & Format Data (30 min) + +Create `prepare_data.py`: + +```python +#!/usr/bin/env python3 +"""Quick data prep using CodeWikiBench only.""" + +from datasets import load_dataset +import json + +# Load CodeWikiBench +print("Downloading CodeWikiBench...") +dataset = load_dataset("anhnh2002/codewikibench") + +# Simple formatting - just use the docs as-is +examples = [] +for item in dataset["train"]: + # Create instruction-response pairs from structured docs + if item.get("structured_docs"): + examples.append({ + "text": f"""<|im_start|>system +You are an expert software architect who creates documentation wikis with source code traceability. +<|im_end|> +<|im_start|>user +Generate architectural documentation for the {item['repo_name']} repository. +<|im_end|> +<|im_start|>assistant +{json.dumps(item['structured_docs'], indent=2)[:8000]} +<|im_end|>""" + }) + +print(f"Created {len(examples)} examples") + +# Save +with open("train_data.jsonl", "w") as f: + for ex in examples[:5000]: # Limit to 5K for quick training + f.write(json.dumps(ex) + "\n") + +print("Saved to train_data.jsonl") +``` + +Run it: +```bash +python prepare_data.py +``` + +## Step 3: Train (1-2 hours) + +Create `train.py`: + +```python +#!/usr/bin/env python3 +"""Quick LoRA fine-tuning of gpt-oss-20b.""" + +from datasets import load_dataset +from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig +from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training +from trl import SFTConfig, SFTTrainer +import torch + +MODEL_ID = "openai/gpt-oss-20b" + +def main(): + print("Loading tokenizer...") + tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) + tokenizer.pad_token = tokenizer.eos_token + + print("Loading model (4-bit quantized)...") + model = AutoModelForCausalLM.from_pretrained( + MODEL_ID, + quantization_config=BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_quant_type="nf4", + bnb_4bit_compute_dtype=torch.bfloat16, + ), + device_map="auto", + trust_remote_code=True, + ) + + model = prepare_model_for_kbit_training(model) + + print("Adding LoRA adapters...") + model = get_peft_model(model, LoraConfig( + r=16, + lora_alpha=32, + lora_dropout=0.05, + target_modules=["q_proj", "k_proj", "v_proj", "o_proj", + "gate_proj", "up_proj", "down_proj"], + bias="none", + task_type="CAUSAL_LM", + )) + model.print_trainable_parameters() + + print("Loading dataset...") + dataset = load_dataset("json", data_files="train_data.jsonl", split="train") + + print("Starting training...") + trainer = SFTTrainer( + model=model, + args=SFTConfig( + output_dir="./output", + num_train_epochs=1, # Just 1 epoch for tonight + per_device_train_batch_size=1, + gradient_accumulation_steps=4, + learning_rate=2e-4, + bf16=True, + logging_steps=10, + save_steps=500, + max_seq_length=4096, + dataset_text_field="text", + ), + train_dataset=dataset, + tokenizer=tokenizer, + ) + + trainer.train() + trainer.save_model("./output/final") + print("Done! Model saved to ./output/final") + +if __name__ == "__main__": + main() +``` + +Run it: +```bash +python train.py +``` + +**Expected output:** +``` +Loading tokenizer... +Loading model (4-bit quantized)... +Adding LoRA adapters... +trainable params: 50,331,648 || all params: 20,900,000,000 || trainable%: 0.24% +Loading dataset... +Starting training... +{'loss': 2.1, 'step': 10} +{'loss': 1.8, 'step': 20} +... +Done! Model saved to ./output/final +``` + +## Step 4: Quick Test (30 min) + +Create `test.py`: + +```python +#!/usr/bin/env python3 +"""Quick test of fine-tuned model.""" + +from transformers import AutoModelForCausalLM, AutoTokenizer +from peft import PeftModel +import torch + +MODEL_ID = "openai/gpt-oss-20b" +ADAPTER_PATH = "./output/final" + +# Load +tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) +base_model = AutoModelForCausalLM.from_pretrained( + MODEL_ID, + torch_dtype=torch.bfloat16, + device_map="auto", + trust_remote_code=True, +) +model = PeftModel.from_pretrained(base_model, ADAPTER_PATH) + +# Test prompt +prompt = """<|im_start|>system +You are an expert software architect who creates documentation wikis with source code traceability. +<|im_end|> +<|im_start|>user +Generate an architecture overview for a Node.js Express API with user authentication and a PostgreSQL database. +<|im_end|> +<|im_start|>assistant +""" + +inputs = tokenizer(prompt, return_tensors="pt").to(model.device) +outputs = model.generate(**inputs, max_new_tokens=1000, temperature=0.7) +print(tokenizer.decode(outputs[0], skip_special_tokens=True)) +``` + +Run: +```bash +python test.py +``` + +--- + +## Cloud GPU Option (If No Local GPU) + +### RunPod (~$2-3 for tonight) + +1. Go to [runpod.io](https://runpod.io) +2. Launch "RTX 4090" template (~$0.44/hr) +3. Select PyTorch template +4. SSH in and run the steps above + +### Google Colab (Free but slower) + +Use this notebook structure: +```python +# Cell 1: Install +!pip install torch transformers accelerate peft trl datasets bitsandbytes + +# Cell 2: Prepare data (copy prepare_data.py) + +# Cell 3: Train (copy train.py, reduce to 1000 examples) + +# Cell 4: Test (copy test.py) +``` + +--- + +## What You'll Have by Tonight + +1. **LoRA adapter** (`./output/final/`) - ~100MB of fine-tuned weights +2. **Basic validation** - Model generates wiki-style documentation +3. **Foundation to iterate** - Can improve data/training tomorrow + +## Tomorrow's Improvements (Optional) + +- [ ] Add synthetic data for source traceability (`file:line` refs) +- [ ] Train for 3 epochs instead of 1 +- [ ] Run proper evaluation +- [ ] Convert to GGUF for SemanticWiki integration + +--- + +## Troubleshooting + +| Issue | Fix | +|-------|-----| +| OOM error | Reduce `max_seq_length` to 2048 | +| Slow download | Model is ~40GB, use fast connection | +| CUDA error | Update: `pip install torch --upgrade` | +| Import error | Install missing: `pip install einops` | + +## Quick Sanity Check + +Before training, verify setup: +```bash +python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, GPU: {torch.cuda.get_device_name(0)}')" +python -c "from transformers import AutoTokenizer; t = AutoTokenizer.from_pretrained('openai/gpt-oss-20b'); print('Tokenizer OK')" +``` + +--- + +That's it! ~3 hours from start to a working fine-tuned model. diff --git a/fine-tuning/README.md b/fine-tuning/README.md index 5672db0..2bc5d57 100644 --- a/fine-tuning/README.md +++ b/fine-tuning/README.md @@ -1,6 +1,19 @@ # Fine-Tuning gpt-oss-20b for SemanticWiki -This directory contains comprehensive planning documents for fine-tuning OpenAI's gpt-oss-20b to become a specialized architectural wiki agent for SemanticWiki in local-only mode. +This directory contains planning documents for fine-tuning OpenAI's gpt-oss-20b to become a specialized architectural wiki agent for SemanticWiki in local-only mode. + +## Quick Start (Tonight!) + +**Want results in 3-4 hours?** See [QUICK-START.md](./QUICK-START.md) for a minimal, achievable plan. + +```bash +pip install torch transformers peft trl datasets bitsandbytes +python prepare_data.py # 30 min - downloads CodeWikiBench +python train.py # 1-2 hrs - LoRA fine-tuning +python test.py # 5 min - verify it works +``` + +--- ## Project Goal From a1a4d6ec81c7e514ece3fc58a4f7a9d974b72ef2 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 20 Jan 2026 00:05:28 +0000 Subject: [PATCH 3/3] Update quick-start to reflect 1K example approach from OpenAI cookbook --- fine-tuning/QUICK-START.md | 312 ++++++++----------------------------- 1 file changed, 62 insertions(+), 250 deletions(-) diff --git a/fine-tuning/QUICK-START.md b/fine-tuning/QUICK-START.md index 94e5c0c..6ba4c59 100644 --- a/fine-tuning/QUICK-START.md +++ b/fine-tuning/QUICK-START.md @@ -1,285 +1,97 @@ -# Tonight's Plan: Quick Fine-Tune gpt-oss-20b +# Tonight's Plan: Fine-Tune gpt-oss-20b (~1 hour) -A minimal, achievable plan to fine-tune gpt-oss-20b for SemanticWiki in one evening (~3-4 hours). +Based on [OpenAI Cookbook guidance](https://cookbook.openai.com/articles/gpt-oss/fine-tune-transfomers): -## Prerequisites - -- [ ] GPU with 24GB+ VRAM (RTX 3090/4090) OR cloud GPU access (RunPod/Lambda) -- [ ] Python 3.10+ -- [ ] ~$5-10 for cloud GPU (if not using local) +> "This is a small dataset of 1,000 examples, but this is usually more than sufficient for models like openai/gpt-oss-20b which have undergone extensive post-training." ## Timeline -| Phase | Time | Task | -|-------|------|------| -| Setup | 20 min | Install deps, download model | -| Data | 30 min | Download CodeWikiBench, format | -| Train | 1-2 hrs | Run LoRA fine-tuning | -| Test | 30 min | Generate wiki, check quality | +| Step | Time | Cost | +|------|------|------| +| Setup | 5 min | Free | +| Data prep | 15 min | Free | +| Synthetic gen (optional) | 15 min | ~$2 | +| Training (3 epochs) | 20-30 min | Free/local | +| Testing | 5 min | Free | +| **Total** | **~1 hour** | **~$2** | ---- +## Use the Dedicated Repo -## Step 1: Environment Setup (20 min) +A complete toolkit has been set up at `../semanticwiki-finetune/`: ```bash -# Create environment -python -m venv venv -source venv/bin/activate - -# Install dependencies -pip install torch transformers accelerate peft trl datasets bitsandbytes - -# Optional: Unsloth for 2x speed (recommended) -pip install unsloth -``` - -## Step 2: Download & Format Data (30 min) - -Create `prepare_data.py`: +cd ../semanticwiki-finetune -```python -#!/usr/bin/env python3 -"""Quick data prep using CodeWikiBench only.""" +# 1. Setup +pip install -r requirements.txt -from datasets import load_dataset -import json +# 2. Prepare ~1,000 high-quality examples from CodeWikiBench +python scripts/prepare_data.py -# Load CodeWikiBench -print("Downloading CodeWikiBench...") -dataset = load_dataset("anhnh2002/codewikibench") +# 3. Optional: Add 200 synthetic examples for source traceability (~$2) +export ANTHROPIC_API_KEY=your_key +python scripts/generate_synthetic.py --num-examples 200 -# Simple formatting - just use the docs as-is -examples = [] -for item in dataset["train"]: - # Create instruction-response pairs from structured docs - if item.get("structured_docs"): - examples.append({ - "text": f"""<|im_start|>system -You are an expert software architect who creates documentation wikis with source code traceability. -<|im_end|> -<|im_start|>user -Generate architectural documentation for the {item['repo_name']} repository. -<|im_end|> -<|im_start|>assistant -{json.dumps(item['structured_docs'], indent=2)[:8000]} -<|im_end|>""" - }) +# 4. Train (3 epochs, ~25 min on RTX 4090) +python scripts/train.py -print(f"Created {len(examples)} examples") - -# Save -with open("train_data.jsonl", "w") as f: - for ex in examples[:5000]: # Limit to 5K for quick training - f.write(json.dumps(ex) + "\n") - -print("Saved to train_data.jsonl") +# 5. Test +python scripts/test.py ``` -Run it: -```bash -python prepare_data.py -``` - -## Step 3: Train (1-2 hours) - -Create `train.py`: - -```python -#!/usr/bin/env python3 -"""Quick LoRA fine-tuning of gpt-oss-20b.""" +## What's in the Repo -from datasets import load_dataset -from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig -from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training -from trl import SFTConfig, SFTTrainer -import torch - -MODEL_ID = "openai/gpt-oss-20b" - -def main(): - print("Loading tokenizer...") - tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) - tokenizer.pad_token = tokenizer.eos_token - - print("Loading model (4-bit quantized)...") - model = AutoModelForCausalLM.from_pretrained( - MODEL_ID, - quantization_config=BitsAndBytesConfig( - load_in_4bit=True, - bnb_4bit_quant_type="nf4", - bnb_4bit_compute_dtype=torch.bfloat16, - ), - device_map="auto", - trust_remote_code=True, - ) - - model = prepare_model_for_kbit_training(model) - - print("Adding LoRA adapters...") - model = get_peft_model(model, LoraConfig( - r=16, - lora_alpha=32, - lora_dropout=0.05, - target_modules=["q_proj", "k_proj", "v_proj", "o_proj", - "gate_proj", "up_proj", "down_proj"], - bias="none", - task_type="CAUSAL_LM", - )) - model.print_trainable_parameters() - - print("Loading dataset...") - dataset = load_dataset("json", data_files="train_data.jsonl", split="train") - - print("Starting training...") - trainer = SFTTrainer( - model=model, - args=SFTConfig( - output_dir="./output", - num_train_epochs=1, # Just 1 epoch for tonight - per_device_train_batch_size=1, - gradient_accumulation_steps=4, - learning_rate=2e-4, - bf16=True, - logging_steps=10, - save_steps=500, - max_seq_length=4096, - dataset_text_field="text", - ), - train_dataset=dataset, - tokenizer=tokenizer, - ) - - trainer.train() - trainer.save_model("./output/final") - print("Done! Model saved to ./output/final") - -if __name__ == "__main__": - main() ``` - -Run it: -```bash -python train.py +semanticwiki-finetune/ +├── scripts/ +│ ├── prepare_data.py # CodeWikiBench → training format +│ ├── generate_synthetic.py # Claude API for targeted examples +│ ├── train.py # QLoRA fine-tuning +│ ├── test.py # Validate output quality +│ ├── merge_and_export.py # Merge weights, convert to GGUF +│ └── publish_dataset.py # Upload to HuggingFace +├── requirements.txt +└── README.md # Full documentation ``` -**Expected output:** -``` -Loading tokenizer... -Loading model (4-bit quantized)... -Adding LoRA adapters... -trainable params: 50,331,648 || all params: 20,900,000,000 || trainable%: 0.24% -Loading dataset... -Starting training... -{'loss': 2.1, 'step': 10} -{'loss': 1.8, 'step': 20} -... -Done! Model saved to ./output/final -``` +## Hardware -## Step 4: Quick Test (30 min) +| GPU | Time (1.2K examples, 3 epochs) | +|-----|-------------------------------| +| RTX 4090 (24GB) | ~25 min | +| RTX 3090 (24GB) | ~30 min | +| A100 (80GB) | ~15 min | +| **Cloud (RunPod)** | ~25 min, ~$0.20 | -Create `test.py`: +## Expected Results -```python -#!/usr/bin/env python3 -"""Quick test of fine-tuned model.""" +| Metric | Base | After Fine-tuning | +|--------|------|-------------------| +| Source traceability | ~50% | ~85% | +| Mermaid validity | ~70% | ~90% | +| Wiki completeness | ~60% | ~85% | -from transformers import AutoModelForCausalLM, AutoTokenizer -from peft import PeftModel -import torch +## Publish Dataset to HuggingFace -MODEL_ID = "openai/gpt-oss-20b" -ADAPTER_PATH = "./output/final" +After training, share your dataset: -# Load -tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) -base_model = AutoModelForCausalLM.from_pretrained( - MODEL_ID, - torch_dtype=torch.bfloat16, - device_map="auto", - trust_remote_code=True, -) -model = PeftModel.from_pretrained(base_model, ADAPTER_PATH) - -# Test prompt -prompt = """<|im_start|>system -You are an expert software architect who creates documentation wikis with source code traceability. -<|im_end|> -<|im_start|>user -Generate an architecture overview for a Node.js Express API with user authentication and a PostgreSQL database. -<|im_end|> -<|im_start|>assistant -""" - -inputs = tokenizer(prompt, return_tensors="pt").to(model.device) -outputs = model.generate(**inputs, max_new_tokens=1000, temperature=0.7) -print(tokenizer.decode(outputs[0], skip_special_tokens=True)) -``` - -Run: ```bash -python test.py -``` - ---- - -## Cloud GPU Option (If No Local GPU) - -### RunPod (~$2-3 for tonight) - -1. Go to [runpod.io](https://runpod.io) -2. Launch "RTX 4090" template (~$0.44/hr) -3. Select PyTorch template -4. SSH in and run the steps above - -### Google Colab (Free but slower) - -Use this notebook structure: -```python -# Cell 1: Install -!pip install torch transformers accelerate peft trl datasets bitsandbytes - -# Cell 2: Prepare data (copy prepare_data.py) - -# Cell 3: Train (copy train.py, reduce to 1000 examples) - -# Cell 4: Test (copy test.py) +huggingface-cli login +python scripts/publish_dataset.py --repo-id your-username/semanticwiki-data ``` ---- - -## What You'll Have by Tonight - -1. **LoRA adapter** (`./output/final/`) - ~100MB of fine-tuned weights -2. **Basic validation** - Model generates wiki-style documentation -3. **Foundation to iterate** - Can improve data/training tomorrow - -## Tomorrow's Improvements (Optional) - -- [ ] Add synthetic data for source traceability (`file:line` refs) -- [ ] Train for 3 epochs instead of 1 -- [ ] Run proper evaluation -- [ ] Convert to GGUF for SemanticWiki integration +## Use with SemanticWiki ---- - -## Troubleshooting - -| Issue | Fix | -|-------|-----| -| OOM error | Reduce `max_seq_length` to 2048 | -| Slow download | Model is ~40GB, use fast connection | -| CUDA error | Update: `pip install torch --upgrade` | -| Import error | Install missing: `pip install einops` | - -## Quick Sanity Check - -Before training, verify setup: ```bash -python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, GPU: {torch.cuda.get_device_name(0)}')" -python -c "from transformers import AutoTokenizer; t = AutoTokenizer.from_pretrained('openai/gpt-oss-20b'); print('Tokenizer OK')" +# Export to GGUF +python scripts/merge_and_export.py --to-gguf --quantize q5_k_m + +# Use with SemanticWiki +semanticwiki generate -r ./your-repo \ + --full-local \ + --model-path output/semanticwiki-wiki-agent-q5_k_m.gguf ``` --- -That's it! ~3 hours from start to a working fine-tuned model. +**That's it!** ~1 hour from start to a fine-tuned wiki documentation agent.