\", add_special_tokens=False)) > 0\n",
+ " else None\n",
+ " ),\n",
+ " )\n",
+ "```\n",
+ "\n",
+ "This configuration specifies important generation parameters:\n",
+ "\n",
+ "- **temperature**: Controls randomness in generation. Higher values (e.g., 1.0) produce more diverse outputs, while lower values (e.g., 0.2) make responses more deterministic.\n",
+ "- **top_p**: Controls nucleus sampling, where only tokens with cumulative probability less than top_p are considered. This helps focus the generation while maintaining diversity.\n",
+ "- **max_new_tokens**: Sets the maximum length of the generated response.\n",
+ "- **suppression_tokens**: Prevents certain tokens (like thinking tags) from appearing in the output.\n",
+ "\n",
+ "### Implementing thinking tag filtering\n",
+ "\n",
+ "Qwen 3 models sometimes use \"thinking\" tags to show their reasoning process. While this is useful for complex problems, we may want to hide this internal reasoning in our final output:\n",
+ "\n",
+ "```python\n",
+ "# Custom text filtering function\n",
+ "def filter_thinking(text):\n",
+ " # Remove anything between and tags\n",
+ " text = re.sub(r\".*?\", \"\", text, flags=re.DOTALL)\n",
+ " # Remove any remaining or tags\n",
+ " text = re.sub(r\"|\", \"\", text)\n",
+ " return text\n",
+ "\n",
+ "# Custom streamer class to filter thinking tags\n",
+ "class FilteredTextStreamer(TextStreamer):\n",
+ " def on_finalized_text(self, text: str, stream_end: bool = False):\n",
+ " filtered_text = filter_thinking(text)\n",
+ " if filtered_text.strip(): # Only print non-empty text\n",
+ " print(filtered_text, end=\"\", flush=True)\n",
+ "```\n",
+ "\n",
+ "This custom streamer filters out the thinking process in real-time during generation, providing a cleaner user experience.\n",
+ "\n",
+ "### Testing with sample queries\n",
+ "\n",
+ "With our generation function set up, we can test the model with some Bullet Echo game questions:\n",
+ "\n",
+ "```python\n",
+ "# Test the model with sample queries\n",
+ "print(\"\\n--- Testing Model Responses ---\")\n",
+ "\n",
+ "test_queries = [\n",
+ " \"What's the best strategy for Cyclops in Bullet Echo?\",\n",
+ " \"How does the Stalker's invisibility work in the game?\",\n",
+ " \"Which heroes are effective against Bastion in Bullet Echo?\",\n",
+ "]\n",
+ "\n",
+ "for query in test_queries:\n",
+ " generate_response(model, tokenizer, query)\n",
+ "```\n",
+ "\n",
+ "Sample outputs might look like:\n",
+ "\n",
+ "```\n",
+ "User: What's the best strategy for Cyclops in Bullet Echo?\n",
+ "Assistant: The best strategy for Cyclops is to stay hidden, as this is his greatest strength. He excels in ambush tactics, allowing him to surprise enemies and maximize his effectiveness in stealthy encounters.\n",
+ "\n",
+ "User: How does the Stalker's invisibility work in the game?\n",
+ "Assistant: The Stalker uses a special ability called invisibility, which makes the character temporarily undetectable by opponents. This is often used for stealth movements and surprise attacks.\n",
+ "```\n",
+ "\n",
+ "### Adapting inference for different applications\n",
+ "\n",
+ "When implementing inference for your own projects, consider these variations:\n",
+ "\n",
+ "**For interactive applications:**\n",
+ "- Use streaming generation to show responses as they're generated\n",
+ "- Consider lower temperature values (0.3-0.5) for more deterministic answers\n",
+ "- Implement a maximum length cutoff appropriate to your UI\n",
+ "\n",
+ "**For batch processing:**\n",
+ "- Disable streaming for faster processing\n",
+ "- Consider using higher batch sizes if memory allows\n",
+ "- Store complete outputs rather than printing them\n",
+ "\n",
+ "**For specialized domains:**\n",
+ "- Adjust temperature based on how creative vs. factual responses should be\n",
+ "- Consider adding domain-specific post-processing to validate outputs\n",
+ "- You might want to keep thinking tags visible for complex reasoning tasks\n",
+ "\n",
+ "## Step 9: Saving and Deploying the Model\n",
+ "\n",
+ "Our fine-tuned model is performing well, so it's time to save it for future use and deployment. This step covers saving the model locally, options for sharing via Hugging Face Hub, and verifying that the saved model works correctly.\n",
+ "\n",
+ "### Saving the fine-tuned model locally\n",
+ "\n",
+ "First, we save the fine-tuned model and tokenizer to local storage:\n",
+ "\n",
+ "```python\n",
+ "print(\"\\nSaving fine-tuned model...\")\n",
+ "output_model_name = \"qwen3-bullet-echo-qa-lora\"\n",
+ "model.save_pretrained(output_model_name)\n",
+ "tokenizer.save_pretrained(output_model_name)\n",
+ "print(f\"Model successfully saved to: ./{output_model_name}\")\n",
+ "```\n",
+ "\n",
+ "This creates a directory containing all the necessary files:\n",
+ "- The LoRA adapter weights (much smaller than the full model)\n",
+ "- Configuration files specifying the model architecture and parameters\n",
+ "- Tokenizer files including vocabulary and special token mappings\n",
+ "\n",
+ "### Hugging Face Hub integration (optional)\n",
+ "\n",
+ "For sharing your model with others or deploying to production, you can push it to Hugging Face Hub:\n",
+ "\n",
+ "```python\n",
+ "# Optional: Push to Hugging Face Hub\n",
+ "# from huggingface_hub import login\n",
+ "# login()\n",
+ "# hub_model_id = f\"your-hf-username/{output_model_name}\"\n",
+ "# model.push_to_hub(hub_model_id)\n",
+ "# tokenizer.push_to_hub(hub_model_id)\n",
+ "# print(f\"Model pushed to Hugging Face Hub: {hub_model_id}\")\n",
+ "```\n",
+ "\n",
+ "This makes your model accessible to others and integrates with various deployment platforms that support Hugging Face models.\n",
+ "\n",
+ "### Loading and verifying the saved model\n",
+ "\n",
+ "To ensure everything was saved correctly, we load the model back and test it:\n",
+ "\n",
+ "```python\n",
+ "print(\"\\n--- Loading Saved Fine-tuned Model ---\")\n",
+ "\n",
+ "# Load the saved model and tokenizer\n",
+ "saved_model_path = output_model_name # \"qwen3-bullet-echo-qa-lora\"\n",
+ "loaded_model, loaded_tokenizer = FastModel.from_pretrained(\n",
+ " model_name=output_model_name,\n",
+ " max_seq_length=2048,\n",
+ " load_in_4bit=True,\n",
+ " full_finetuning=False,\n",
+ ")\n",
+ "\n",
+ "# Enable faster inference\n",
+ "unsloth.FastModel.for_inference(loaded_model)\n",
+ "\n",
+ "print(\"Model successfully loaded for inference!\")\n",
+ "\n",
+ "# Test with new queries\n",
+ "print(\"\\n--- Testing Loaded Model Responses ---\")\n",
+ "\n",
+ "new_test_queries = [\n",
+ " \"What's the best strategy for Cyclops in Bullet Echo?\",\n",
+ " \"How does the Stalker's invisibility work in the game?\",\n",
+ " \"Which heroes are effective against Bastion in Bullet Echo?\",\n",
+ "]\n",
+ "\n",
+ "for query in new_test_queries:\n",
+ " generate_response(loaded_model, loaded_tokenizer, query, temperature=0.2)\n",
+ "```\n",
+ "\n",
+ "Notice we use a lower temperature (0.2) here to get more deterministic responses for easier comparison.\n",
+ "\n",
+ "### Deployment considerations\n",
+ "\n",
+ "When deploying your fine-tuned model for real-world use, consider these approaches:\n",
+ "\n",
+ "**For web applications:**\n",
+ "- Use Hugging Face Inference API for managed hosting\n",
+ "- Deploy as a container with FastAPI or Flask for more control\n",
+ "- Consider quantizing to INT8 or INT4 for production efficiency\n",
+ "\n",
+ "**For local applications:**\n",
+ "- Export to ONNX format for faster CPU inference\n",
+ "- Use llama.cpp for optimized deployment on edge devices\n",
+ "- Consider merging LoRA weights with the base model for simplified deployment\n",
+ "\n",
+ "**For scaling considerations:**\n",
+ "- Use vLLM or text-generation-inference for higher throughput\n",
+ "- Implement caching for common queries\n",
+ "- Consider distilling into a smaller model for resource-constrained environments\n",
+ "\n",
+ "By following these steps, you've successfully fine-tuned a powerful Qwen 3 model to create a specialized assistant for the Bullet Echo game. The resulting model can now answer domain-specific questions with accuracy and relevance, while maintaining the general capabilities of the base model.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Conclusion\n",
+ "\n",
+ "In this step-by-step guide, we've walked through the complete process of fine-tuning Qwen 3 on a custom dataset. We started by creating a specialized question-answer dataset from the Bullet Echo wiki using Firecrawl's AI-powered extraction capabilities, then prepared our training environment with appropriate hardware and memory optimizations. Through Parameter-Efficient Fine-Tuning with QLoRA, we were able to adapt a 14B parameter model while training only 0.23% of its parameters, making the process feasible on a single GPU. Our implementation of proper validation strategies, optimization techniques, and inference configuration resulted in a model that can accurately answer domain-specific questions about the Bullet Echo game.\n",
+ "\n",
+ "The techniques demonstrated here can be applied to create specialized AI assistants for virtually any domain. Whether you're building a customer support bot, a technical documentation assistant, or a domain-specific knowledge base, the combination of Firecrawl for dataset creation and Unsloth for optimized fine-tuning provides a powerful toolkit for customizing large language models. To create your own custom datasets for fine-tuning, consider exploring [Firecrawl's AI-powered extraction](https://firecrawl.dev) capabilities, which eliminate the need for complex web scraping code and make dataset creation accessible even without extensive technical knowledge. As language models continue to evolve, the ability to efficiently adapt them to specialized domains will remain a key competitive advantage for developers and organizations.\n",
+ "\n",
+ "> Don't forget to check out [the full code](https://github.com/mendableai/firecrawl-app-examples/tree/main/qwen3-fine-tuning) for this article from our GitHub repository."
+ ]
+ }
+ ],
+ "metadata": {
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/qwen3-fine-tuning/pyproject.toml b/qwen3-fine-tuning/pyproject.toml
new file mode 100644
index 00000000..b6da56b1
--- /dev/null
+++ b/qwen3-fine-tuning/pyproject.toml
@@ -0,0 +1,7 @@
+[project]
+name = "qwen3-fine-tuning"
+version = "0.1.0"
+description = "Add your description here"
+readme = "README.md"
+requires-python = ">=3.11"
+dependencies = []
diff --git a/qwen3-fine-tuning/qwen3-fine-tune.ipynb b/qwen3-fine-tuning/qwen3-fine-tune.ipynb
new file mode 100644
index 00000000..1ce1bfe4
--- /dev/null
+++ b/qwen3-fine-tuning/qwen3-fine-tune.ipynb
@@ -0,0 +1,838 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "ce7ee550-9cc3-4725-8487-d9995378affc",
+ "metadata": {},
+ "source": [
+ "# Fine-tuning Qwen 3 on a Custom Dataset"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ec49a9fc-ba24-4583-bc07-9bbdb652d46c",
+ "metadata": {},
+ "source": [
+ "## Imports and setup"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "id": "1247d75d-a193-4b69-8711-09ab4ebca7bc",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%%capture\n",
+ "!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo\n",
+ "!pip install sentencepiece protobuf datasets huggingface_hub hf_transfer\n",
+ "!pip install --no-deps unsloth\n",
+ "!pip install regex transformers rich"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "842abacc-c719-4d84-97ab-42f0577a243b",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:\n",
+ " PyTorch 2.6.0+cu124 with CUDA 1204 (you have 2.8.0.dev20250319+cu128)\n",
+ " Python 3.11.11 (you have 3.11.11)\n",
+ " Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)\n",
+ " Memory-efficient attention, SwiGLU, sparse and more won't be available.\n",
+ " Set XFORMERS_MORE_DETAILS=1 for more details\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "🦥 Unsloth Zoo will now patch everything to make training faster!\n"
+ ]
+ }
+ ],
+ "source": [
+ "import unsloth\n",
+ "import torch\n",
+ "from unsloth import FastModel\n",
+ "from datasets import load_dataset\n",
+ "from trl import SFTTrainer, SFTConfig\n",
+ "from transformers import TextStreamer, GenerationConfig\n",
+ "import re"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2b315379-d5d2-43fa-ad0c-0bbec0dec8a5",
+ "metadata": {},
+ "source": [
+ "## Load model and tokenizer"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "a2d66215-3349-40ab-882d-b98fc974ca6e",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Loading Qwen3 model and tokenizer...\n",
+ "==((====))== Unsloth 2025.5.2: Fast Qwen3 patching. Transformers: 4.51.3.\n",
+ " \\\\ /| NVIDIA L40S. Num GPUs = 1. Max memory: 44.521 GB. Platform: Linux.\n",
+ "O^O/ \\_/ \\ Torch: 2.8.0.dev20250319+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.3.0\n",
+ "\\ / Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]\n",
+ " \"-____-\" Free license: http://github.com/unslothai/unsloth\n",
+ "Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!\n"
+ ]
+ },
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "2c4727f7ba7b4e129932afaac8fa18b7",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "Loading checkpoint shards: 0%| | 0/3 [00:00, ?it/s]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "print(\"Loading Qwen3 model and tokenizer...\")\n",
+ "model, tokenizer = FastModel.from_pretrained(\n",
+ " model_name=\"unsloth/Qwen3-14B\",\n",
+ " max_seq_length=2048, # Choose any for long context\n",
+ " load_in_4bit=True, # 4 bit quantization to reduce memory\n",
+ " full_finetuning=False,\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "81ebc8c8-848a-4dee-843e-393c95f8fd2c",
+ "metadata": {},
+ "source": [
+ "## Load and tokenize dataset"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "e775e705-4c69-4d8e-ab94-c8a6fdc99cfa",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Loading Bullet Echo Wiki QA dataset...\n",
+ "Training examples: 2711, Validation examples: 302\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(\"Loading Bullet Echo Wiki QA dataset...\")\n",
+ "dataset_name = \"bexgboost/bullet-echo-wiki-qa\"\n",
+ "full_dataset = load_dataset(dataset_name, trust_remote_code=True)\n",
+ "\n",
+ "# Split dataset into training and validation sets (90% train, 10% validation)\n",
+ "train_val_split = full_dataset[\"train\"].train_test_split(\n",
+ " test_size=0.1, seed=42, shuffle=True\n",
+ ")\n",
+ "train_dataset = train_val_split[\"train\"]\n",
+ "val_dataset = train_val_split[\"test\"] # This becomes our validation set\n",
+ "\n",
+ "print(\n",
+ " f\"Training examples: {len(train_dataset)}, Validation examples: {len(val_dataset)}\"\n",
+ ")\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "036cbacd-819a-4e97-ba98-81da1149fcfd",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Formatting datasets with Qwen3 chat template...\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(\"Formatting datasets with Qwen3 chat template...\")\n",
+ "EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN\n",
+ "\n",
+ "\n",
+ "def format_data(example):\n",
+ " # Qwen3 uses a chat template, so we'll format it accordingly\n",
+ " messages = [\n",
+ " {\"role\": \"user\", \"content\": example[\"question\"]},\n",
+ " {\"role\": \"assistant\", \"content\": example[\"answer\"] + EOS_TOKEN},\n",
+ " ]\n",
+ " # The tokenizer.apply_chat_template handles special tokens for Qwen3\n",
+ " return {\"text\": tokenizer.apply_chat_template(messages, tokenize=False)}\n",
+ "\n",
+ "\n",
+ "# Format both training and validation datasets\n",
+ "formatted_train_dataset = train_dataset.map(format_data)\n",
+ "formatted_val_dataset = val_dataset.map(format_data)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "id": "f14c3bb9-8e49-49b5-aa84-8d6dff20bb55",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Tokenizing datasets...\n"
+ ]
+ },
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "29abb3a926674946a0ae988a57ed532f",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "Tokenizing training dataset: 0%| | 0/2711 [00:00, ? examples/s]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "f8c5dd25ccfc4176911ad5cd4664b2d7",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "Tokenizing validation dataset: 0%| | 0/302 [00:00, ? examples/s]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "print(\"Tokenizing datasets...\")\n",
+ "\n",
+ "\n",
+ "def tokenize_function(examples):\n",
+ " # padding=False because SFTTrainer will handle padding\n",
+ " return tokenizer(\n",
+ " examples[\"text\"],\n",
+ " padding=False,\n",
+ " truncation=True,\n",
+ " max_length=model.config.max_position_embeddings,\n",
+ " )\n",
+ "\n",
+ "\n",
+ "# Process both datasets\n",
+ "processed_train_dataset = formatted_train_dataset.map(\n",
+ " tokenize_function,\n",
+ " batched=True,\n",
+ " remove_columns=[\"id\", \"question\", \"answer\", \"text\"],\n",
+ " desc=\"Tokenizing training dataset\",\n",
+ ")\n",
+ "\n",
+ "processed_val_dataset = formatted_val_dataset.map(\n",
+ " tokenize_function,\n",
+ " batched=True,\n",
+ " remove_columns=[\"id\", \"question\", \"answer\", \"text\"],\n",
+ " desc=\"Tokenizing validation dataset\",\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2f699444-0772-4c98-9c19-04ae1789e7db",
+ "metadata": {},
+ "source": [
+ "## Setup PEFT Model"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "id": "ad2b9621-b407-400f-b3c2-ab66a283c26d",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Setting up PEFT model with LoRA...\n",
+ "Unsloth: Making `model.base_model.model.model` require gradients\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(\"Setting up PEFT model with LoRA...\")\n",
+ "model = FastModel.get_peft_model(\n",
+ " model,\n",
+ " r=8,\n",
+ " target_modules=[\n",
+ " \"q_proj\",\n",
+ " \"k_proj\",\n",
+ " \"v_proj\",\n",
+ " \"o_proj\",\n",
+ " \"gate_proj\",\n",
+ " \"up_proj\",\n",
+ " \"down_proj\",\n",
+ " ],\n",
+ " finetune_vision_layers=False, # Turn off for just text!\n",
+ " finetune_language_layers=True,\n",
+ " finetune_attention_modules=True,\n",
+ " finetune_mlp_modules=True,\n",
+ " lora_alpha=8,\n",
+ " lora_dropout=0,\n",
+ " bias=\"none\",\n",
+ " use_gradient_checkpointing=\"unsloth\",\n",
+ " random_state=1000,\n",
+ " use_rslora=False,\n",
+ ")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "15dd8aee-70e6-4b3b-acfe-c58d2873a658",
+ "metadata": {},
+ "source": [
+ "## Train the model"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "id": "6e39ccac-529a-4b39-853b-e2af4054393c",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Configuring SFTTrainer with evaluation...\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(\"Configuring SFTTrainer with evaluation...\")\n",
+ "trainer = SFTTrainer(\n",
+ " model=model,\n",
+ " tokenizer=tokenizer,\n",
+ " train_dataset=processed_train_dataset,\n",
+ " eval_dataset=processed_val_dataset, # Add validation dataset\n",
+ " args=SFTConfig(\n",
+ " dataset_text_field=\"text\",\n",
+ " per_device_train_batch_size=2,\n",
+ " per_device_eval_batch_size=2, # Batch size for evaluation\n",
+ " gradient_accumulation_steps=4,\n",
+ " warmup_steps=5,\n",
+ " num_train_epochs=3,\n",
+ " # max_steps=100, # For quick testing\n",
+ " learning_rate=2e-4,\n",
+ " logging_steps=200,\n",
+ " optim=\"adamw_8bit\",\n",
+ " weight_decay=0.01,\n",
+ " lr_scheduler_type=\"linear\",\n",
+ " seed=3407,\n",
+ " output_dir=\"outputs\",\n",
+ " eval_strategy=\"steps\",\n",
+ " eval_steps=200, # Evaluate every 200 steps\n",
+ " save_strategy=\"steps\", # Save checkpoints based on evaluation\n",
+ " save_steps=200, # Save every 200 steps\n",
+ " load_best_model_at_end=True, # Load best model at the end of training\n",
+ " metric_for_best_model=\"eval_loss\", # Use evaluation loss to determine best model\n",
+ " greater_is_better=False, # Lower loss is better\n",
+ " save_total_limit=3, # Keep only the 3 best checkpoints\n",
+ " ),\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "id": "cce0efed-ccaa-4e2e-9cbb-5dda1a636a16",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Starting fine-tuning process with validation...\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1\n",
+ " \\\\ /| Num examples = 2,711 | Num Epochs = 3 | Total steps = 1,017\n",
+ "O^O/ \\_/ \\ Batch size per device = 2 | Gradient accumulation steps = 4\n",
+ "\\ / Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8\n",
+ " \"-____-\" Trainable parameters = 32,112,640/14,000,000,000 (0.23% trained)\n",
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Unsloth: Will smartly offload gradients to save VRAM!\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ " \n",
+ " \n",
+ "
\n",
+ " [1017/1017 24:13, Epoch 3/3]\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | Step | \n",
+ " Training Loss | \n",
+ " Validation Loss | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 200 | \n",
+ " 1.571900 | \n",
+ " 1.287612 | \n",
+ "
\n",
+ " \n",
+ " | 400 | \n",
+ " 1.211400 | \n",
+ " 1.214332 | \n",
+ "
\n",
+ " \n",
+ " | 600 | \n",
+ " 1.081000 | \n",
+ " 1.182069 | \n",
+ "
\n",
+ " \n",
+ " | 800 | \n",
+ " 0.960400 | \n",
+ " 1.207953 | \n",
+ "
\n",
+ " \n",
+ " | 1000 | \n",
+ " 0.879500 | \n",
+ " 1.197931 | \n",
+ "
\n",
+ " \n",
+ "
"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "Unsloth: Not an error, but Qwen3ForCausalLM does not accept `num_items_in_batch`.\n",
+ "Using gradient accumulation will be very slightly less accurate.\n",
+ "Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Training completed!\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(\"Starting fine-tuning process with validation...\")\n",
+ "training_results = trainer.train()\n",
+ "\n",
+ "# Print evaluation metrics\n",
+ "print(\"Training completed!\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "id": "4c8b5924-bb35-44b5-8c08-a25b745bb83b",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Final training metrics: {'train_runtime': 1461.829, 'train_samples_per_second': 5.564, 'train_steps_per_second': 0.696, 'total_flos': 5.716715497264128e+16, 'train_loss': 1.136481964013804}\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(f\"Final training metrics: {training_results.metrics}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5afb684e-c9fe-4501-be87-99867236582b",
+ "metadata": {},
+ "source": [
+ "## Model inference"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "id": "acc1f03a-9ab1-4c16-86d6-6d1a1867132e",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Setting up model for inference...\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(\"Setting up model for inference...\")\n",
+ "unsloth.FastModel.for_inference(model) # Enable native 2x faster inference\n",
+ "\n",
+ "\n",
+ "def generate_response(\n",
+ " model, tokenizer, query, temperature=0.7, top_p=0.9, max_new_tokens=256\n",
+ "):\n",
+ " \"\"\"\n",
+ " Generate a response from the fine-tuned model.\n",
+ "\n",
+ " Args:\n",
+ " model: The fine-tuned model\n",
+ " tokenizer: The tokenizer\n",
+ " query: The user query/question\n",
+ " temperature: Controls randomness in generation (lower = more deterministic)\n",
+ " top_p: Nucleus sampling parameter (lower = more focused)\n",
+ " max_new_tokens: Maximum new tokens to generate\n",
+ "\n",
+ " Returns:\n",
+ " Generated response text\n",
+ " \"\"\"\n",
+ " # Format the query as a chat message\n",
+ " messages = [{\"role\": \"user\", \"content\": query}]\n",
+ "\n",
+ " # Prepare model inputs\n",
+ " inputs = tokenizer.apply_chat_template(\n",
+ " messages, add_generation_prompt=True, return_tensors=\"pt\"\n",
+ " ).to(\"cuda\")\n",
+ "\n",
+ " # Create attention mask (all 1s) with the same shape as inputs\n",
+ " attention_mask = torch.ones_like(inputs).to(\"cuda\")\n",
+ "\n",
+ " # Configure generation parameters\n",
+ " generation_config = GenerationConfig(\n",
+ " temperature=temperature,\n",
+ " top_p=top_p,\n",
+ " do_sample=True,\n",
+ " max_new_tokens=max_new_tokens,\n",
+ " pad_token_id=tokenizer.pad_token_id,\n",
+ " eos_token_id=tokenizer.eos_token_id,\n",
+ " remove_invalid_values=True,\n",
+ " # Disable thinking tags\n",
+ " suppression_tokens=(\n",
+ " [\n",
+ " tokenizer.encode(\"\", add_special_tokens=False)[0],\n",
+ " tokenizer.encode(\"\", add_special_tokens=False)[0],\n",
+ " ]\n",
+ " if len(tokenizer.encode(\"\", add_special_tokens=False)) > 0\n",
+ " else None\n",
+ " ),\n",
+ " )\n",
+ "\n",
+ " # Custom text filtering function\n",
+ " def filter_thinking(text):\n",
+ " # Remove anything between and tags\n",
+ " text = re.sub(r\".*?\", \"\", text, flags=re.DOTALL)\n",
+ " # Remove any remaining or tags\n",
+ " text = re.sub(r\"|\", \"\", text)\n",
+ " return text\n",
+ "\n",
+ " # Custom streamer class to filter thinking tags\n",
+ " class FilteredTextStreamer(TextStreamer):\n",
+ " def on_finalized_text(self, text: str, stream_end: bool = False):\n",
+ " filtered_text = filter_thinking(text)\n",
+ " if filtered_text.strip(): # Only print non-empty text\n",
+ " print(filtered_text, end=\"\", flush=True)\n",
+ "\n",
+ " # Initialize filtered text streamer\n",
+ " streamer = FilteredTextStreamer(\n",
+ " tokenizer, skip_prompt=True, skip_special_tokens=True\n",
+ " )\n",
+ "\n",
+ " # Display query\n",
+ " print(f\"User: {query}\")\n",
+ " print(\"Assistant:\")\n",
+ "\n",
+ " # Generate response\n",
+ " output = model.generate(\n",
+ " inputs,\n",
+ " attention_mask=attention_mask,\n",
+ " generation_config=generation_config,\n",
+ " streamer=streamer,\n",
+ " return_dict_in_generate=True,\n",
+ " output_scores=False,\n",
+ " )\n",
+ "\n",
+ " # For non-streaming use (optional):\n",
+ " # output_text = tokenizer.decode(output.sequences[0], skip_special_tokens=True)\n",
+ " # return filter_thinking(output_text)\n",
+ "\n",
+ " print(\"\\n\") # Add a newline after generation\n",
+ " return None # Since we're streaming, we don't return the output"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "id": "5cbd2416-8b12-4c8e-86c0-a671d2b8f7a4",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n",
+ "--- Testing Model Responses ---\n",
+ "User: What's the best strategy for Cyclops in Bullet Echo?\n",
+ "Assistant:\n",
+ "The best strategy for Cyclops is to stay hidden, as this is his greatest strength. He excels in ambush tactics, allowing him to surprise enemies and maximize his effectiveness in stealthy encounters.\n",
+ "\n",
+ "User: How does the Stalker's invisibility work in the game?\n",
+ "Assistant:\n",
+ "The Stalker uses a special ability called invisibility, which makes the character temporarily undetectable by opponents. This is often used for stealth movements and surprise attacks.\n",
+ "\n",
+ "User: Which heroes are effective against Bastion in Bullet Echo?\n",
+ "Assistant:\n",
+ "Heroes with high damage and quick movement, such as Levi, Blot, and Lynx, are effective against Bastion due to their ability to deal quick hits and outmaneuver his slow but powerful attacks.\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Test the model with sample queries\n",
+ "print(\"\\n--- Testing Model Responses ---\")\n",
+ "\n",
+ "test_queries = [\n",
+ " \"What's the best strategy for Cyclops in Bullet Echo?\",\n",
+ " \"How does the Stalker's invisibility work in the game?\",\n",
+ " \"Which heroes are effective against Bastion in Bullet Echo?\",\n",
+ "]\n",
+ "\n",
+ "for query in test_queries:\n",
+ " generate_response(model, tokenizer, query)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "85578bd7-3e0c-4424-a4ec-ac58385f32d8",
+ "metadata": {},
+ "source": [
+ "## Save model"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "id": "992a07bb-2d65-4fd4-acbc-cca8eafdeae7",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n",
+ "Saving fine-tuned model...\n",
+ "Model successfully saved to: ./qwen3-bullet-echo-qa-lora\n",
+ "\n",
+ "🦥 Fine-tuning script completed successfully! 🦥\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(\"\\nSaving fine-tuned model...\")\n",
+ "output_model_name = \"qwen3-bullet-echo-qa-lora\"\n",
+ "model.save_pretrained(output_model_name)\n",
+ "tokenizer.save_pretrained(output_model_name)\n",
+ "print(f\"Model successfully saved to: ./{output_model_name}\")\n",
+ "\n",
+ "# Optional: Push to Hugging Face Hub\n",
+ "# from huggingface_hub import login\n",
+ "# login()\n",
+ "# hub_model_id = f\"your-hf-username/{output_model_name}\"\n",
+ "# model.push_to_hub(hub_model_id)\n",
+ "# tokenizer.push_to_hub(hub_model_id)\n",
+ "# print(f\"Model pushed to Hugging Face Hub: {hub_model_id}\")\n",
+ "\n",
+ "print(\"\\n🦥 Fine-tuning script completed successfully! 🦥\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "91a89ca4-a37b-4f0f-aebf-e0a9202ea45b",
+ "metadata": {},
+ "source": [
+ "## Load saved model"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "id": "26bcfd58-7dd0-458b-a80d-4fd5dccf33ec",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n",
+ "--- Loading Saved Fine-tuned Model ---\n",
+ "==((====))== Unsloth 2025.5.2: Fast Qwen3 patching. Transformers: 4.51.3.\n",
+ " \\\\ /| NVIDIA L40S. Num GPUs = 1. Max memory: 44.521 GB. Platform: Linux.\n",
+ "O^O/ \\_/ \\ Torch: 2.8.0.dev20250319+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.3.0\n",
+ "\\ / Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]\n",
+ " \"-____-\" Free license: http://github.com/unslothai/unsloth\n",
+ "Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!\n"
+ ]
+ },
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "f2c5052a639e46c7913e00107d193279",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "Loading checkpoint shards: 0%| | 0/3 [00:00, ?it/s]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Model successfully loaded for inference!\n",
+ "\n",
+ "--- Testing Loaded Model Responses ---\n",
+ "User: What's the best strategy for Cyclops in Bullet Echo?\n",
+ "Assistant:\n",
+ "The best strategy for Cyclops is to stay hidden and avoid direct confrontation, using his stealth and invisibility to ambush enemies or escape dangerous situations.\n",
+ "\n",
+ "User: How does the Stalker's invisibility work in the game?\n",
+ "Assistant:\n",
+ "The Stalker can become invisible for a limited time, allowing it to move undetected across the map. This invisibility can be used to avoid enemies, set up ambushes, or escape dangerous situations.\n",
+ "\n",
+ "User: Which heroes are effective against Bastion in Bullet Echo?\n",
+ "Assistant:\n",
+ "Heroes with high damage and mobility, such as Lynx, Slayer, and Stalker, are effective against Bastion due to his low health and slow movement speed.\n",
+ "\n",
+ "\n",
+ "🦥 Model loading and inference testing completed! 🦥\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(\"\\n--- Loading Saved Fine-tuned Model ---\")\n",
+ "\n",
+ "# Load the saved model and tokenizer\n",
+ "saved_model_path = output_model_name # \"qwen3-bullet-echo-qa-lora\"\n",
+ "loaded_model, loaded_tokenizer = FastModel.from_pretrained(\n",
+ " model_name=output_model_name,\n",
+ " max_seq_length=2048,\n",
+ " load_in_4bit=True,\n",
+ " full_finetuning=False,\n",
+ ")\n",
+ "\n",
+ "# Enable faster inference\n",
+ "unsloth.FastModel.for_inference(loaded_model)\n",
+ "\n",
+ "print(\"Model successfully loaded for inference!\")\n",
+ "\n",
+ "# Test with new queries\n",
+ "print(\"\\n--- Testing Loaded Model Responses ---\")\n",
+ "\n",
+ "new_test_queries = [\n",
+ " \"What's the best strategy for Cyclops in Bullet Echo?\",\n",
+ " \"How does the Stalker's invisibility work in the game?\",\n",
+ " \"Which heroes are effective against Bastion in Bullet Echo?\",\n",
+ "]\n",
+ "\n",
+ "for query in new_test_queries:\n",
+ " generate_response(loaded_model, loaded_tokenizer, query, temperature=0.2)\n",
+ "\n",
+ "print(\"\\n🦥 Model loading and inference testing completed! 🦥\")\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "e9be2fc4-7817-4be7-98e1-7e43b131ed80",
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.11.11"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}