Training pipeline for the Luciole series of multilingual (French-English) large language models, part of the OpenLLM-France initiative.
This repository covers the full LLM lifecycle: data preparation, tokenizer training, pretraining, fine-tuning, and evaluation. It is designed to run on the Jean-Zay HPC cluster (NVIDIA H100 GPUs) via SLURM.
┌──────────────┐ ┌───────────────┐ ┌──────────────┐ ┌──────────────┐
│ 1. Data │───▶│ 2. Tokenizer │───▶│ 3. Tokenize │───▶│ 4. Datamix │
│ Processing │ │ Training │ │ Datasets │ │ Creation │
│ data/ │ │ tokenizer_ │ │ data/ │ │ pretrain/ │
│ processing/ │ │ train/ │ │ tokenization/│ │ datamix/ │
└──────────────┘ └───────────────┘ └──────────────┘ └──────┬───────┘
│
┌───────────────┐ ┌──────────────┐ ┌──────▼───────┐
│ 8. Evaluation │◀───│ 7. Convert │◀───│ 5. Pretrain │
│ evaluation/ │ │ to HF format │ │ pretrain/ │
└───────┬───────┘ │ pretrain/ │ │ train/ │
│ │ conversion/ │ └──────────────┘
│ └──────────────┘
┌───────▼───────┐
│ 6. Fine-tune │
│ (optional) │
│ finetune/ │
└───────────────┘
| Directory | Description | Documentation |
|---|---|---|
data/processing/ |
50+ dataset-specific preprocessing scripts (FineWeb, DCLM, Gallica, StarCoder, ...) | data/README.md |
data/tokenization/ |
Tokenize processed datasets and compute statistics | data/README.md |
data/synthetic_generation/ |
Generate synthetic training data using LLMs | data/synthetic_generation/README.md |
tokenizer_train/ |
Train and evaluate custom BPE tokenizers | tokenizer_train/README.md |
pretrain/train/ |
Distributed pretraining via NeMo + SLURM | pretrain/README.md |
pretrain/conversion/ |
Convert NeMo checkpoints to HuggingFace format | pretrain/conversion/README.md |
finetune/ |
SFT and RL fine-tuning with NeMo | finetune/README.md |
evaluation/ |
Benchmark evaluation with lighteval | evaluation/README.md |
- Access to Jean-Zay HPC cluster (or similar SLURM-based cluster with H100 GPUs)
- Set the following environment variables in your
.bashprofile:
export OpenLLM_OUTPUT=/path/to/shared/output # Shared output directory
export SLURM_ACCOUNT_GPU="your_account@h100" # SLURM GPU account
export SLURM_ACCOUNT_CPU="your_account@cpu" # SLURM CPU account# 1. Process data
cd data/ && source set_env.sh
python processing/fineweb2 --local # test locally
python processing/fineweb2 # run on SLURM
# 2. Tokenize
python tokenization/run_tokenization.py datasets.yaml $OpenLLM_OUTPUT/data/tokens \
--tokenizer_name OpenLLM-BPI/tokenizer_128k-arab-regional_v2
# 3. Train
cd ../pretrain/train/
python slurm_launcher.py --output_dir $OpenLLM_OUTPUT/pretrain/my_exp \
--arch nemotron1b --mode phase1 --datamix datamixes/luciole_phase1.json
# 4. Convert checkpoints to HuggingFace format
cd ../conversion/
sbatch convert.slurm $OpenLLM_OUTPUT/pretrain/my_exp
# 5. Evaluate
cd ../../evaluation/
python evaluate_experiment.py $OpenLLM_OUTPUT/pretrain/my_exp tasks/en.txtRecipes in pretrain/train/recipes/:
| Family | Sizes |
|---|---|
| LLaMA | 1B, 3B, 8B, 21B, 24B, 70B |
| Mistral | 12B, Small 24B |
| Mixtral | 8x7B |
| Nemotron | 1B, 4B, 8B, 22B, 23B + variants |
| NemotronH | 8B, 47B (hybrid Mamba/Transformer) |
Each pipeline stage uses its own environment. See the README in each subdirectory for setup instructions:
- Data processing:
data/README.md— condadatatrove-env(Python 3.10) + datatrove fork - Pretraining:
pretrain/README.md— NeMo module on Jean-Zay +zarr - Fine-tuning:
finetune/README.md— NeMo-RL withuv - Evaluation:
evaluation/README.md— condaeval-env+ lighteval fork
| Variable | Description |
|---|---|
OpenLLM_OUTPUT |
Shared output directory for all pipeline stages |
SLURM_ACCOUNT_GPU |
SLURM account for GPU jobs (e.g. wuh@h100) |
SLURM_ACCOUNT_CPU |
SLURM account for CPU jobs (e.g. qgz@cpu) |
HF_HOME |
HuggingFace cache directory |
HF_MODELS_CACHE |
Local mirror of HuggingFace models |
HF_DATASETS_MIRROR |
Local mirror of HuggingFace datasets |
pre-commit run --all-files
# Or directly:
ruff check . --fix
ruff format .This project is licensed under the GNU General Public License v3 — see LICENSE for details.