Skip to content

ilboglions/GeneralisationGeneralises

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Do Generalisation Results Generalise?

arXiv Venue

Code for the paper "Do Generalisation Results Generalise?", accepted to Findings of ACL 2026.

Authors: Matteo Boglioni¹², Andrea Sgobbi¹, Gabriel Tavernini¹, Francesco Rita¹, Marius Mosbach²³, Tiago Pimentel¹

¹ETH Zürich    ²Mila – Québec AI Institute    ³McGill University

📄 arXiv:2512.07832


1. Overview

Out-of-distribution (OOD) generalisation is typically measured by evaluating an LLM on a single OOD dataset. That's cheap, but it doesn't tell you whether the model's generalisation is a stable trait or an artefact of the dataset choice. This project asks whether OOD generalisation results generalise: does doing well on one OOD testset predict doing well on another?

We fine-tune three model families — OLMo2 (7B / 13B / 32B), OPT (2.7b / 6.7b / 13b / 30b), and SmolLM (3B) — on MNLI or SNLI, and evaluate them throughout training on eight NLI testsets (SNLI, MNLI, SciTail, RTE, HANS, ANLI, WNLI, PAWS). We then compute partial correlations between per-step performance on any pair of OOD testsets, regressing out in-domain performance. The headline finding: there is no overarching trend — the sign and magnitude of the correlation between any two OOD testsets depends strongly on which model you analyse.

2. Setup

module load stack/2024-06 cuda/11.8.0 python_cuda

Create a conda environment and install the dependencies:

conda create -n DL2024 python=3.11.10
conda activate DL2024
conda install cudatoolkit==11.8 -c nvidia
pip install torch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 --index-url https://download.pytorch.org/whl/cu118
conda install -c "nvidia/label/cuda-11.8.0" cuda-nvcc
conda install packaging
pip install -r requirements.txt

Environment variables

All cluster-specific paths and credentials have been factored out of the configs. Set these before running any training:

Variable Used for
HF_CACHE_DIR HuggingFace models + datasets cache
OUTPUT_DIR Trainer checkpoint / log output root
WANDB_ENTITY wandb team/user (read natively by wandb.init)
WANDB_PROJECT wandb project name (read natively by wandb.init)

finetune.sh exports these at the top — edit that file (or override on the command line) to point them at your cluster paths and wandb project.

3. Repository structure

.
├── configs/                       # 144 TOML configs, {model}_{train_ds}_{seed}_{shot}.toml
│   ├── accelerate_{olmo,opt}.yaml
│   ├── olmo_{mnli,snli}_{0,42,99}_{32,64,128}.toml
│   ├── olmo13_...  olmo32_...    # same scheme for 13B / 32B
│   ├── opt_...  opt6,7_...  opt13_...  opt30_...
│   └── smol_...
├── src/
│   ├── data/NLIDataset.py         # HF dataset wrappers for the 8 NLI testsets
│   ├── models/{OLMOModel,OPTModel,SmolLMModel}.py
│   ├── train.py                   # DeepSpeed training entrypoint
│   ├── train_accelerate.py        # Accelerate alternative
│   └── utils.py
├── notebooks/
│   ├── wrapped_plotting.ipynb     # Reproduces every Overleaf figure/table
│   ├── cache.pkl                  # wandb run cache (auto-created)
│   └── assets/                    # Plot output; mirrors the Overleaf assets/ layout
├── finetune.sh                    # SLURM launcher (DeepSpeed or Accelerate)
├── requirements.txt
└── README.md

Config naming

Every TOML in configs/ follows

{model}_{train_dataset}_{seed}_{few_shot}.toml
  • modelolmo, olmo13, olmo32, opt, opt6,7, opt13, opt30, smol
  • train_datasetmnli, snli
  • seed0, 42, 99
  • few_shot32, 64, 128

Giving 8 × 2 × 3 × 3 = 144 files.

4. Training

Edit finetune.sh to set CONFIG to the TOML you want and update the #SBATCH --job-name (used for the log filename), then submit:

sbatch finetune.sh

The script defaults to DeepSpeed on 4 GPUs; an Accelerate backend is available as a commented alternative. Wandb logging is on by default — set WANDB_MODE=offline (or report_to='none' in the TOML) to disable.

For a local dry run without SLURM:

python src/train.py --config configs/smol_snli_0_128.toml

5. Reproducing the paper's figures and tables

All figures and LaTeX tables in the paper are generated by a single notebook:

jupyter lab notebooks/wrapped_plotting.ipynb

Top-to-bottom execution produces notebooks/assets/ — a directory laid out identically to the Overleaf assets/ folder. The only slow cell is the wandb fetch, which caches to notebooks/cache.pkl and is skipped on subsequent runs (delete the file to force a refresh). See the markdown sections inside the notebook for a per-figure guide.

6. Citation

If you use this code, please cite:

@misc{boglioni2025generalisationresultsgeneralise,
      title={Do Generalisation Results Generalise?}, 
      author={Matteo Boglioni and Andrea Sgobbi and Gabriel Tavernini and Francesco Rita and Marius Mosbach and Tiago Pimentel},
      year={2025},
      eprint={2512.07832},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.07832}, 
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors