Skip to content

caiasprojects/SyntheticDataTraining

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

README

Overview

This repository contains scripts and pipelines for training and fine-tuning models, as well as generating and pruning synthetic datasets. The project is organized into two main folders:

  1. finetune/ - Contains scripts for fine-tuning models using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO).
  2. generating_data/ - Contains scripts for generating synthetic datasets, pruning them, and preparing them for model training.

Folder Structure

finetune/

This folder includes training scripts that focus on optimizing models using different methodologies:

  • SFT (sft_math_or_code.py): Implements supervised fine-tuning where the model is trained on labeled math/code datasets.
  • DPO (dp.py): Implements Direct Preference Optimization, which fine-tunes models based on positive negative examples

generating_data/

This folder contains scripts for synthetic data generation and dataset pruning:

  • generate_data.py: Generates synthetic data using pre-trained models, and 3 datasets in generating_data/qsall.
  • runcc.py, ev_math.py, runleet.py: Filters out unpassing exaples from each of the generated datasets to improve training quality.
  • mkjson.py: Prepares synthetic data into a structured format compatible with the fine-tuning pipeline.

Usage

Fine-Tuning

To train a model using SFT:

python finetune/sft_math_or_code.py --model_name google/gemma-2-2b-it --train_data_path data/math2b_4k.json --eval_data_path data/math2bev.json --learning_rate 1e-5 --output_dir out --proj myproj --math True

To train a model using DPO:

python finetune/dp.py --model_name_or_path google/gemma-2-2b-it --train_data_path data/dpo_cc_train.json --eval_data_path data/dpo_cc_eval.json --learning_rate 1e-5 --output_dir output --proj myproj

Generating Synthetic Data

To generate domain-specific synthetic data, only incliude one of cc, math, leetocde flags:

python generate_data.py --model_name google/gemma-2-2b-it --num_samples 10 --cc qsall/train6k.json --leetcode qsall/lctrain.json --math qsall/gsm8ktrain.json --output_dir samples

Dependencies

Ensure you have the required dependencies installed before running any scripts:

pip install -r requirements.txt

Notes

  • Fine-tuning scripts support multiple models (e.g., LLaMA, Gemma).
  • Synthetic data generation uses LLMs with specific prompt engineering techniques to improve quality.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages