This repository contains the data preprocessing pipelines, supervised fine-tuning (SFT) scripts, and evaluation framework used in the VectorGym benchmark paper.
VectorGym benchmarks large multimodal language models on SVG code generation tasks. The primary task is text-to-SVG: given a natural language description, generate valid, visually accurate SVG code.
The pipeline covers:
- Data annotation — generating captions for SVG images using a vision-language model
- Fine-tuning — LoRA-based SFT of Qwen2.5-VL-32B-Instruct
- Evaluation — multi-metric validation (L2, LPIPS, SSIM, FID, CLIP Score, DINO Score, token length)
svg-research/
├── configs/
│ ├── generation/
│ │ └── text2svg.yaml # Inference/validation config
│ └── sft/lora/qwen_2.5vl_instruct_32b/
│ └── text2svg_config.yaml # LoRA fine-tuning config
├── eval/
│ ├── metrics/ # Individual metric implementations
│ ├── svg_validator_base.py # Abstract validator base class
│ ├── svg_validator_hf.py # HuggingFace inference backend
│ ├── vllm_svg_validator.py # vLLM inference backend
│ └── validate.py # Validation CLI entry point
├── train/
│ ├── train.py # SFT training script
│ └── util.py # Model loading and data preprocessing
├── utils/
│ ├── annotate.py # SVG caption generation (vLLM)
│ ├── dataset_utils.py # Dataset splitting and Hub upload
│ ├── svg_util.py # SVG rendering and processing
│ └── utils.py # Shared helpers
├── scripts/
│ └── upload_lora_adapters.py # Upload adapters to Hugging Face Hub
└── pyproject.toml
Python >= 3.11 is required.
git clone https://github.com/alys28/svg-research.git
cd svg-research
pip install -e .For training with DeepSpeed and WandB logging:
pip install -e ".[train]"utils/annotate.py uses a vLLM-hosted vision-language model to generate natural language descriptions for SVG images. These captions serve as the text prompts during SFT.
utils/dataset_utils.py provides utilities for chunking large datasets and uploading splits to the Hugging Face Hub:
from utils.dataset_utils import split_into_one_repo, copy_splits_to_repoThe training data is hosted at svg-hub/svg-stack-annotated-sample.
Training uses TRL's SFTTrainer with LoRA (via PEFT) applied to all attention and MLP projection layers of Qwen2.5-VL-32B-Instruct.
Edit configs/sft/lora/qwen_2.5vl_instruct_32b/text2svg_config.yaml to set training hyperparameters. Key defaults:
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen2.5-VL-32B-Instruct |
| LoRA rank | 64 |
| LoRA alpha | 32 |
| Learning rate | 5e-5 |
| Batch size (per device) | 4 |
| Gradient accumulation | 8 |
| Max sequence length | 16384 |
| Precision | bfloat16 |
| LR scheduler | cosine |
python train/train.py --config configs/sft/lora/qwen_2.5vl_instruct_32b/text2svg_config.yamlCheckpoints are saved every 250 steps. Training metrics (loss, gradient norm, learning rate) are logged to WandB and exported to plots/.
After training, upload LoRA adapters to the Hugging Face Hub:
python scripts/upload_lora_adapters.py \
--folder_path outputs/checkpoint-2500 \
--repo_id svg-hub/qwen_2.5vl_instruct_text2svg_ckpt_2500 \
--token $HF_TOKENEdit configs/generation/text2svg.yaml. Key settings:
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen2.5-VL-32B-Instruct |
| PEFT adapter | svg-hub/qwen_2.5vl_instruct_text2svg_ckpt_2500 |
| Dataset split | test_00 |
| Rasterize / render size | 512 × 512 |
| Temperature | 0.7 |
| Max new tokens | 16384 |
python eval/validate.py --config configs/generation/text2svg.yamlCLI flags override any config value, e.g.:
python eval/validate.py --config configs/generation/text2svg.yaml --batch_size 8 --temperature 0.9| Metric | Description |
|---|---|
| L2 | Pixel-level Euclidean distance |
| LPIPS | Learned perceptual image patch similarity |
| SSIM | Structural similarity index |
| FID | Fréchet Inception Distance |
| CLIP Score | Semantic text–image alignment |
| DINO Score | Feature-based image similarity |
| Token Length | SVG code token count |
Results are logged to WandB and written to the output directory specified in the config.
Two backends are supported via the validator registry:
SVGHFValidator— HuggingFacetransformers(default)VLLMValidator— vLLM for faster batch inference
Set generation_engine in the generation config to switch backends.
Create a .env file in the project root with:
HF_TOKEN=your_huggingface_token
WANDB_API_KEY=your_wandb_key
VectorGym: A Multi-Task Benchmark for SVG Code Generation and Manipulation