Domain Name Generator

A fine-tuned LLM system that generates domain name suggestions and evaluates them with an LLM-as-a-Judge. Run instructions at the end.

1. Methodology and Setup Overview

Model family: Mistral 7B v0.3 (Instruct)
Fine-tuning strategy: LoRA adapters using Hugging Face TRL, SFT (Supervised Fine-Tuning)
Early stopping: patience = 2
Evaluation: LLM-as-a-Judge (an OpenAI model) for quality and safety scoring
Devices: CUDA or Apple Metal (MPS) supported; CPU fallback

Why Mistral 7B? Compact enough to run locally while being strong for instruction-following. LoRA enables fast, low-cost adapters with small on-disk footprint and quick iteration.

2. Dataset Design and Generation

Two data splits power the iteration loop:

Training datasets: data/dataset_v{n}.csv
Held-out test set: data/test_set.csv

Each dataset has two columns:

description: one- or multi-sentence business description
suggestions: a JSON array of exactly 5 single-token strings or [] for refusal rows

Generation approach:

The initial dataset (v1) was produced via a prompt to generate around 100 rows, including about 10 explicit refusals ([]).
Certain topics were intentionally omitted from v1 (e.g., birds, water play, hunting) so the test set could stress these blind spots.
The test set contains 24–30 rows, including requested omissions, several specific constraints (e.g., “avoid hyphens”, “prefer number-based naming”), ultra-long descriptions, and 4–6 refusal rows.

Formatting constraints for suggestions:

No TLDs (e.g., no .com)
Allowed characters: lowercase a–z, digits 0–9, and hyphen -
Length: 5–50 characters
Unique, pronounceable, and brandable; avoid trademarks and real company names

Implementation notes:

TLDs and JSON-typed outputs were removed from training targets to avoid unnecessary complexity during SFT and to improve convergence.

3. Evaluation Framework (Quality and Safety)

We assess both the quality of domain suggestions and safety behavior.

Quality (LLM-as-a-Judge):

Each generated domain is scored on relevance, creativity, brandability, and conciseness (1–5).
Low-scoring outputs are assigned categories (e.g., random words, too long), enabling aggregate analysis and targeted fixes.

Safety classification and metrics:

The judge also labels each input description as appropriate or inappropriate.
Model behavior is compared against this ground truth, and we compute precision/recall/F1 for blocking inappropriate content.
Confusion-matrix terms:
- TP (ok): safe content correctly allowed
- TN (confirmed_inappropriate): harmful content correctly blocked
- FP (false_positive_inappropriate): safe content wrongly blocked
- FN (missed_inappropriate): harmful content wrongly allowed

Artifacts:

Detailed CSV evaluations per model version: data/model_{version}-results.csv
Example visuals: category distribution and safety analysis plots
- images/domain_category_count.png
- images/v2-safety-classification_analysis.png

4. Edge Case Discovery

How we surface weaknesses:

Filter to valid outputs (exclude FP and FN safety cases) and compute a normalized mean quality score per row in [0, 1].
Rank from lowest to highest; study the bottom cases to understand failure patterns.
Count categories (good/ok/random/too long/other) to identify systemic issues.

Observed edge cases:

“Missing children” combined with adult or harmful themes (gambling, alcohol, tobacco)
False positives that block legitimate legal services
Some domain-specific creativity gaps

5. Iterative Improvement (v1 → v2)

Strategy:

Use insights from low-quality and unsafe cases to prompt-generate additional data focusing on identified gaps.
Augment v1 with targeted new rows and refusals, then fine-tune a new LoRA adapter (v2) using the same hyperparameters.

Measured changes on the same test set:

Fewer “random word” domains (v2 vs v1)
Quality: +0.0067 absolute (+0.97% relative)
Safety (F1): +0.0718 absolute (+8.29% relative)
Total error rate (FP + FN): reduced from 20.9% (v1) to 9.3% (v2)
- Absolute change: −11.6%
- Relative reduction: −55.5%

Takeaway: Data augmentation targeted at real failure modes significantly improved safety with a modest lift in quality.

6. Model Comparison and Recommendation

Metric	v1	v2
Quality (avg)	0.6893	0.6960
Safety (F1)	0.791	0.907

Recommendation: Deploy v2. While quality gains are small, the safety improvement is substantial and important for this task.

7. Future Work

Expose token logprobs to quantify model confidence per suggestion; include in scoring and UI.
Capture and visualize training metrics (loss/val curves, LR schedules); integrate MLFlow/W&B.
Provide a simple API service with runtime guardrails (e.g., a lightweight safety checker before returning results).
Add Docker/Compose for reproducible environments.
Hyperparameter tuning with Optuna (LoRA rank/alpha, LR, scheduler, batch sizes).
Weight FN higher than FP in the safety metric if product risk demands it.
Explore alternative save_strategy/eval_strategy beyond epoch to capture better checkpoints.
Investigate loss alternatives to pure token cross-entropy for list-like outputs (e.g., semantic similarity–aware objectives).

8. Project Structure and Key Files

Root CLI: main.py (training and evaluation entrypoint)
Training: src/train.py
Evaluation: src/model_eval.py
Data: data/dataset_v*.csv, data/test_set.csv, data/model_*-results.csv
Notebooks: notebooks/analyse_v1.ipynb, notebooks/analyse_v2.ipynb, notebooks/train.ipynb, notebooks/evaluate.ipynb
Images: images/
Models: models/model_{version}/ (created after training)

9. How to Run (setup and commands)

Prerequisites

Python 3.10+ (tested on recent versions)
Git
OpenAI API key (for evaluation)
Hugging Face access token (for base model and/or gated weights)
Optional GPU with CUDA or Apple MPS

Note for CUDA users: Ensure your PyTorch build matches your CUDA version. Use the PyTorch selector: https://pytorch.org/get-started/locally/

1) Clone and enter the repo

git clone https://github.com/ohmatheus/domain.yours.git
cd domain.yours

2) Create and activate a virtual environment

python -m venv venv
# macOS/Linux
source venv/bin/activate
# Windows
venv\Scripts\activate

3) Install dependencies

pip install -r requirements.txt

4) Configure environment variables

Copy and edit the example:

cp .env_example .env

Required variables (see src/settings.py):

HUGGINGFACE_API_TOKEN
OPENAI_API_KEY
DEVICE (optional: cuda, mps, or cpu)

Example .env:

HUGGINGFACE_API_TOKEN=your_hf_token_here
OPENAI_API_KEY=your_openai_api_key_here
DEVICE=cuda

5) Train

Train a specific version:

python main.py train --version v1
python main.py train --version v2

Train all detected versions:

python main.py train --version all

Stop-on-error (when using all):

python main.py train --version all --stop-on-error

Datasets must follow: data/dataset_v1.csv, data/dataset_v2.csv, ... with columns description and suggestions.

6) Evaluate

Evaluate a specific trained model version:

python main.py eval --version v1
python main.py eval --version v2

Evaluate all trained models:

python main.py eval --version all

Stop-on-error (when using all):

python main.py eval --version all --stop-on-error

Requirements for evaluation:

Trained models in models/model_{version}/
Test set at data/test_set.csv
Valid OpenAI API key in the environment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Domain Name Generator

1. Methodology and Setup Overview

2. Dataset Design and Generation

3. Evaluation Framework (Quality and Safety)

4. Edge Case Discovery

5. Iterative Improvement (v1 → v2)

6. Model Comparison and Recommendation

7. Future Work

8. Project Structure and Key Files

9. How to Run (setup and commands)

Prerequisites

1) Clone and enter the repo

2) Create and activate a virtual environment

3) Install dependencies

4) Configure environment variables

5) Train

6) Evaluate

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
data		data
images		images
notebooks		notebooks
src		src
.env_example		.env_example
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
domain.yours TECH REPORT.pdf		domain.yours TECH REPORT.pdf
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Domain Name Generator

1. Methodology and Setup Overview

2. Dataset Design and Generation

3. Evaluation Framework (Quality and Safety)

4. Edge Case Discovery

5. Iterative Improvement (v1 → v2)

6. Model Comparison and Recommendation

7. Future Work

8. Project Structure and Key Files

9. How to Run (setup and commands)

Prerequisites

1) Clone and enter the repo

2) Create and activate a virtual environment

3) Install dependencies

4) Configure environment variables

5) Train

6) Evaluate

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages