Llamdex provides a complete pipeline for customizing large language models to structured data domains. This release bundles training code, preprocessing utilities, analysis scripts, and reproducible baseline implementations so you can reproduce our experiments or adapt the workflow to new datasets.
baseline/ Reference implementations for comparison baselines
dp-opt/ Differentially private OPT baseline with training + sweeps
mink-plus-plus/ External MINK++ implementation (git submodule)
MICO/ Model-based inference baseline (git submodule)
tablediffusion/ Differentially private diffusion models for tables
conf/ YAML configuration presets for end-to-end runs
src/ Llamdex source code
analysis/ Plotting, reporting, and DP evaluation utilities
dataset/ Downloaders, metadata, and synthetic data recipes
evaluate/ Model evaluation entry points
fine_tune/ LoRA fine-tuning utilities for Llama/Mistral backbones
model/ Core model definitions and expert routing modules
preprocess/ Data cleaning, feature generation, and text synthesis scripts
script/ Orchestrated experiment runners for training and evaluation
synthesis/ Synthetic data generation pipelines
train/ Training entry scripts for different regimes
- Create a Python 3.10+ environment.
- Install the core dependencies:
pip install -r baseline/requirements.txt
pip install accelerate deepspeed peft sentencepiece torch torchvision torchaudio
pip install pandas scikit-learn tqdm transformers xgboost gorilla tensorboard
GPU support: use the PyTorch wheels that match your CUDA/ROCm stack as documented on pytorch.org.
Download raw datasets using the scripts in src/dataset
:
bash src/dataset/bank_marketing/download.sh
bash src/dataset/titanic/download.sh
bash src/dataset/wine_quality/download.sh
bash src/dataset/nursery/download.sh
Each dataset follows the same preprocessing flow:
dataset=bank_marketing
python src/preprocess/clean/clean_${dataset}.py
python src/preprocess/syn/syn_${dataset}.py
python src/preprocess/expert/${dataset}_mlp.py
python src/preprocess/gentext/gentext_dataset.py --dataset ${dataset}
Synthetic generation recipes (src/preprocess/syn/*.py
) and expert models (src/preprocess/expert/*.py
) can be customized per domain. To automate the pipeline for multiple datasets, run:
DATASETS="bank_marketing titanic" bash src/script/prepare_data.sh
The release ships lean entry points in src/script/
that run sequentially without device-specific scheduling. Override DATASETS, SEEDS, or LAYERS to focus on particular runs:
bash src/script/train_llamdex.sh # Train Llamdex models
bash src/script/evaluate_llamdex.sh # Evaluate saved checkpoints
Adjust hyperparameters via environment variables (see the script headers) or edit the underlying trainers in src/train/
.
Use src/analysis/
to regenerate plots and tables, including DP trade-off curves (plot_dp.py
), expert ablations (plot_ablation_expert_weight.py
), and membership inference studies (mia_mico_style.py
). The utilities read logs produced by the runner scripts and emit publication-ready figures and LaTeX tables.
baseline/dp-opt
: Differentially private OPT fine-tuning with ready-to-run sweep configurations.baseline/tablediffusion
: DP table diffusion models with GAN, VAE, and SAINT back-ends.baseline/MICO
andbaseline/mink-plus-plus
: External repositories vendored as submodules; initialize them withgit submodule update --init --recursive
before use.
If you use Llamdex in your research, please cite:
@article{wu2024model,
title={Model-based Large Language Model Customization as Service},
author={Wu, Zhaomin and Guo, Jizhou and Hou, Junyi and He, Bingsheng and Fan, Lixin and Yang, Qiang},
journal={EMNLP},
year={2025}
}
This release is distributed under the Apache License 2.0. By contributing or using the software, you agree to the terms of that license.