Do Generalisation Results Generalise?

Code for the paper "Do Generalisation Results Generalise?", accepted to Findings of ACL 2026.

Authors: Matteo Boglioni¹², Andrea Sgobbi¹, Gabriel Tavernini¹, Francesco Rita¹, Marius Mosbach²³, Tiago Pimentel¹

¹ETH Zürich ²Mila – Québec AI Institute ³McGill University

1. Overview

Out-of-distribution (OOD) generalisation is typically measured by evaluating an LLM on a single OOD dataset. That's cheap, but it doesn't tell you whether the model's generalisation is a stable trait or an artefact of the dataset choice. This project asks whether OOD generalisation results generalise: does doing well on one OOD testset predict doing well on another?

We fine-tune three model families — OLMo2 (7B / 13B / 32B), OPT (2.7b / 6.7b / 13b / 30b), and SmolLM (3B) — on MNLI or SNLI, and evaluate them throughout training on eight NLI testsets (SNLI, MNLI, SciTail, RTE, HANS, ANLI, WNLI, PAWS). We then compute partial correlations between per-step performance on any pair of OOD testsets, regressing out in-domain performance. The headline finding: there is no overarching trend — the sign and magnitude of the correlation between any two OOD testsets depends strongly on which model you analyse.

2. Setup

module load stack/2024-06 cuda/11.8.0 python_cuda

Create a conda environment and install the dependencies:

conda create -n DL2024 python=3.11.10
conda activate DL2024
conda install cudatoolkit==11.8 -c nvidia
pip install torch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 --index-url https://download.pytorch.org/whl/cu118
conda install -c "nvidia/label/cuda-11.8.0" cuda-nvcc
conda install packaging
pip install -r requirements.txt

Environment variables

All cluster-specific paths and credentials have been factored out of the configs. Set these before running any training:

Variable	Used for
`HF_CACHE_DIR`	HuggingFace models + datasets cache
`OUTPUT_DIR`	Trainer checkpoint / log output root
`WANDB_ENTITY`	wandb team/user (read natively by `wandb.init`)
`WANDB_PROJECT`	wandb project name (read natively by `wandb.init`)

finetune.sh exports these at the top — edit that file (or override on the command line) to point them at your cluster paths and wandb project.

3. Repository structure

.
├── configs/                       # 144 TOML configs, {model}_{train_ds}_{seed}_{shot}.toml
│   ├── accelerate_{olmo,opt}.yaml
│   ├── olmo_{mnli,snli}_{0,42,99}_{32,64,128}.toml
│   ├── olmo13_...  olmo32_...    # same scheme for 13B / 32B
│   ├── opt_...  opt6,7_...  opt13_...  opt30_...
│   └── smol_...
├── src/
│   ├── data/NLIDataset.py         # HF dataset wrappers for the 8 NLI testsets
│   ├── models/{OLMOModel,OPTModel,SmolLMModel}.py
│   ├── train.py                   # DeepSpeed training entrypoint
│   ├── train_accelerate.py        # Accelerate alternative
│   └── utils.py
├── notebooks/
│   ├── wrapped_plotting.ipynb     # Reproduces every Overleaf figure/table
│   ├── cache.pkl                  # wandb run cache (auto-created)
│   └── assets/                    # Plot output; mirrors the Overleaf assets/ layout
├── finetune.sh                    # SLURM launcher (DeepSpeed or Accelerate)
├── requirements.txt
└── README.md

Config naming

Every TOML in configs/ follows

{model}_{train_dataset}_{seed}_{few_shot}.toml

model ∈ olmo, olmo13, olmo32, opt, opt6,7, opt13, opt30, smol
train_dataset ∈ mnli, snli
seed ∈ 0, 42, 99
few_shot ∈ 32, 64, 128

Giving 8 × 2 × 3 × 3 = 144 files.

4. Training

Edit finetune.sh to set CONFIG to the TOML you want and update the #SBATCH --job-name (used for the log filename), then submit:

sbatch finetune.sh

The script defaults to DeepSpeed on 4 GPUs; an Accelerate backend is available as a commented alternative. Wandb logging is on by default — set WANDB_MODE=offline (or report_to='none' in the TOML) to disable.

For a local dry run without SLURM:

python src/train.py --config configs/smol_snli_0_128.toml

5. Reproducing the paper's figures and tables

All figures and LaTeX tables in the paper are generated by a single notebook:

jupyter lab notebooks/wrapped_plotting.ipynb

Top-to-bottom execution produces notebooks/assets/ — a directory laid out identically to the Overleaf assets/ folder. The only slow cell is the wandb fetch, which caches to notebooks/cache.pkl and is skipped on subsequent runs (delete the file to force a refresh). See the markdown sections inside the notebook for a per-figure guide.

6. Citation

If you use this code, please cite:

@misc{boglioni2025generalisationresultsgeneralise,
      title={Do Generalisation Results Generalise?}, 
      author={Matteo Boglioni and Andrea Sgobbi and Gabriel Tavernini and Francesco Rita and Marius Mosbach and Tiago Pimentel},
      year={2025},
      eprint={2512.07832},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.07832}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Do Generalisation Results Generalise?

1. Overview

2. Setup

Environment variables

3. Repository structure

Config naming

4. Training

5. Reproducing the paper's figures and tables

6. Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
finetune.sh		finetune.sh
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Do Generalisation Results Generalise?

1. Overview

2. Setup

Environment variables

3. Repository structure

Config naming

4. Training

5. Reproducing the paper's figures and tables

6. Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages