This repository contains scripts and pipelines for training and fine-tuning models, as well as generating and pruning synthetic datasets. The project is organized into two main folders:
finetune/- Contains scripts for fine-tuning models using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO).generating_data/- Contains scripts for generating synthetic datasets, pruning them, and preparing them for model training.
This folder includes training scripts that focus on optimizing models using different methodologies:
- SFT (
sft_math_or_code.py): Implements supervised fine-tuning where the model is trained on labeled math/code datasets. - DPO (
dp.py): Implements Direct Preference Optimization, which fine-tunes models based on positive negative examples
This folder contains scripts for synthetic data generation and dataset pruning:
generate_data.py: Generates synthetic data using pre-trained models, and 3 datasets in generating_data/qsall.runcc.py, ev_math.py, runleet.py: Filters out unpassing exaples from each of the generated datasets to improve training quality.mkjson.py: Prepares synthetic data into a structured format compatible with the fine-tuning pipeline.
To train a model using SFT:
python finetune/sft_math_or_code.py --model_name google/gemma-2-2b-it --train_data_path data/math2b_4k.json --eval_data_path data/math2bev.json --learning_rate 1e-5 --output_dir out --proj myproj --math TrueTo train a model using DPO:
python finetune/dp.py --model_name_or_path google/gemma-2-2b-it --train_data_path data/dpo_cc_train.json --eval_data_path data/dpo_cc_eval.json --learning_rate 1e-5 --output_dir output --proj myprojTo generate domain-specific synthetic data, only incliude one of cc, math, leetocde flags:
python generate_data.py --model_name google/gemma-2-2b-it --num_samples 10 --cc qsall/train6k.json --leetcode qsall/lctrain.json --math qsall/gsm8ktrain.json --output_dir samplesEnsure you have the required dependencies installed before running any scripts:
pip install -r requirements.txt- Fine-tuning scripts support multiple models (e.g., LLaMA, Gemma).
- Synthetic data generation uses LLMs with specific prompt engineering techniques to improve quality.