Code for the paper "Do Generalisation Results Generalise?", accepted to Findings of ACL 2026.
Authors: Matteo Boglioni¹², Andrea Sgobbi¹, Gabriel Tavernini¹, Francesco Rita¹, Marius Mosbach²³, Tiago Pimentel¹
¹ETH Zürich ²Mila – Québec AI Institute ³McGill University
Out-of-distribution (OOD) generalisation is typically measured by evaluating an LLM on a single OOD dataset. That's cheap, but it doesn't tell you whether the model's generalisation is a stable trait or an artefact of the dataset choice. This project asks whether OOD generalisation results generalise: does doing well on one OOD testset predict doing well on another?
We fine-tune three model families — OLMo2 (7B / 13B / 32B), OPT (2.7b / 6.7b / 13b / 30b), and SmolLM (3B) — on MNLI or SNLI, and evaluate them throughout training on eight NLI testsets (SNLI, MNLI, SciTail, RTE, HANS, ANLI, WNLI, PAWS). We then compute partial correlations between per-step performance on any pair of OOD testsets, regressing out in-domain performance. The headline finding: there is no overarching trend — the sign and magnitude of the correlation between any two OOD testsets depends strongly on which model you analyse.
module load stack/2024-06 cuda/11.8.0 python_cudaCreate a conda environment and install the dependencies:
conda create -n DL2024 python=3.11.10
conda activate DL2024
conda install cudatoolkit==11.8 -c nvidia
pip install torch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 --index-url https://download.pytorch.org/whl/cu118
conda install -c "nvidia/label/cuda-11.8.0" cuda-nvcc
conda install packaging
pip install -r requirements.txtAll cluster-specific paths and credentials have been factored out of the configs. Set these before running any training:
| Variable | Used for |
|---|---|
HF_CACHE_DIR |
HuggingFace models + datasets cache |
OUTPUT_DIR |
Trainer checkpoint / log output root |
WANDB_ENTITY |
wandb team/user (read natively by wandb.init) |
WANDB_PROJECT |
wandb project name (read natively by wandb.init) |
finetune.sh exports these at the top — edit that file (or override on the command line) to point them at your cluster paths and wandb project.
.
├── configs/ # 144 TOML configs, {model}_{train_ds}_{seed}_{shot}.toml
│ ├── accelerate_{olmo,opt}.yaml
│ ├── olmo_{mnli,snli}_{0,42,99}_{32,64,128}.toml
│ ├── olmo13_... olmo32_... # same scheme for 13B / 32B
│ ├── opt_... opt6,7_... opt13_... opt30_...
│ └── smol_...
├── src/
│ ├── data/NLIDataset.py # HF dataset wrappers for the 8 NLI testsets
│ ├── models/{OLMOModel,OPTModel,SmolLMModel}.py
│ ├── train.py # DeepSpeed training entrypoint
│ ├── train_accelerate.py # Accelerate alternative
│ └── utils.py
├── notebooks/
│ ├── wrapped_plotting.ipynb # Reproduces every Overleaf figure/table
│ ├── cache.pkl # wandb run cache (auto-created)
│ └── assets/ # Plot output; mirrors the Overleaf assets/ layout
├── finetune.sh # SLURM launcher (DeepSpeed or Accelerate)
├── requirements.txt
└── README.md
Every TOML in configs/ follows
{model}_{train_dataset}_{seed}_{few_shot}.toml
- model ∈
olmo,olmo13,olmo32,opt,opt6,7,opt13,opt30,smol - train_dataset ∈
mnli,snli - seed ∈
0,42,99 - few_shot ∈
32,64,128
Giving 8 × 2 × 3 × 3 = 144 files.
Edit finetune.sh to set CONFIG to the TOML you want and update the #SBATCH --job-name (used for the log filename), then submit:
sbatch finetune.shThe script defaults to DeepSpeed on 4 GPUs; an Accelerate backend is available as a commented alternative. Wandb logging is on by default — set WANDB_MODE=offline (or report_to='none' in the TOML) to disable.
For a local dry run without SLURM:
python src/train.py --config configs/smol_snli_0_128.tomlAll figures and LaTeX tables in the paper are generated by a single notebook:
jupyter lab notebooks/wrapped_plotting.ipynbTop-to-bottom execution produces notebooks/assets/ — a directory laid out identically to the Overleaf assets/ folder. The only slow cell is the wandb fetch, which caches to notebooks/cache.pkl and is skipped on subsequent runs (delete the file to force a refresh). See the markdown sections inside the notebook for a per-figure guide.
If you use this code, please cite:
@misc{boglioni2025generalisationresultsgeneralise,
title={Do Generalisation Results Generalise?},
author={Matteo Boglioni and Andrea Sgobbi and Gabriel Tavernini and Francesco Rita and Marius Mosbach and Tiago Pimentel},
year={2025},
eprint={2512.07832},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.07832},
}