MOSAIC: Multilingual, Taxonomy-Agnostic, and Computationally Efficient Radiological Report Classification

MOSAIC is a framework for efficient radiological report classification that is:

🌐 Multilingual: Works across different languages
🎯 Taxonomy-Agnostic: Adapts to various classification schemes
⚡ Computationally Efficient: Optimized for resource usage

What’s in this repo

mosaic/core/finetune.py: Unsloth + TRL SFT training with optional LoRA, early stopping, and WANDB logging.
mosaic/core/inference.py: VLLM-based evaluation that scores predictions with weighted F1 and saves CSVs.
mosaic/core/translate.py: Translate report datasets into multiple languages using VLLM.
mosaic/core/perplexity.py: Perplexity utilities for MOSAIC and SIB-200 corpora.
mosaic/core/preprocess_data.py: CLI wrappers that mirror the notebook preprocessing blocks.
config/: Model, dataset, and experiment settings (see config/exp/README.md for experiment tags).
scripts/: Thin shell wrappers to run training/evaluation with sensible defaults.

Setup

conda env create -f environment.yaml
conda activate mosaic
pip install -e .

Update config/paths.yaml (default base is /home/alice/work/mosaic) and config/datasets.yaml so the paths point to your local HuggingFace load_from_disk datasets.

Data layout

Each dataset entry in config/datasets.yaml should point to a datasets-format directory containing train/, val/, and test/ splits.
Splits are expected to include report (text), labels (stringified dict of finding → class), classes (list of class ids), findings (list of strings), and optional fs_examples for few-shot prompts.
To recreate datasets from raw files, use the notebook-aligned CLI:
```
python -m mosaic.core.preprocess_data --function mimic \
  --input-dir /path/to/raw/data \
  --output-dir data/mimic
```
Available functions include mimic, padchest, casia, danskcxr, reflacx, and variants listed inside mosaic/core/preprocess_data.py.

Training

You can call the module directly or use scripts/run_training.sh (which activates the mosaic conda env and checks data paths).

python -m mosaic.core.finetune \
  --model_name medgemma-4b \              # key from config/models.yaml
  --config_tag m \                        # experiment folder in config/exp/
  --train_dataset_names "mimic" \         # space-separated keys from config/datasets.yaml
  --valid_dataset_names "mimic" \
  --output_dir outputs \
  --project_name None                     # set to a WANDB project name to enable logging

Outputs are saved under <output_dir>/models/<experiment_name>/ and checkpoints under <output_dir>/checkpoints/<experiment_name>/.

Evaluation and inference

Evaluation runs via VLLM and writes prediction/eval CSVs.

python -m mosaic.core.inference \
  --model_name medgemma-4b \              # same key used for training
  --zeroshot off \                        # one of: zeroshot | fewshot | ft-fewshot | off
  --train_dataset_names "mimic" \
  --test_dataset_names "mimic padchest_EN" \
  --models_folder outputs \               # where your trained models live
  --output_dir outputs/eval \
  --experiment_tag _m

zeroshot/fewshot run the base model, ft-fewshot adds few-shot prompts to a fine-tuned model, and off uses the fine-tuned checkpoints as-is.

Translate datasets

python -m mosaic.core.translate \
  --model_tag medgemma-4b \
  --dataset_name mimic \
  --models_folder outputs \
  --output_folder data/mimic_translated \
  --source_language eng_Latn \
  --target_languages "dan_Latn spa_Latn fra_Latn" \
  --train_only True

Languages and VLLM sampling defaults come from config/languages.yaml and config/vllm.yaml.

Perplexity utilities

python -m mosaic.core.perplexity \
  -m unsloth/gemma-3-4b-it \
  -d mosaic \
  -o outputs/ppl \
  --debug True

Use -d sib for the SIB-200 benchmark. Results are saved as CSVs.

Citation

If you use MOSAIC in your research, please cite:

@misc{schiavone2025mosaicmultilingualtaxonomyagnosticcomputationally,
      title={MOSAIC: A Multilingual, Taxonomy-Agnostic, and Computationally Efficient Approach for Radiological Report Classification}, 
      author={Alice Schiavone and Marco Fraccaro and Lea Marie Pehrson and Silvia Ingala and Rasmus Bonnevie and Michael Bachmann Nielsen and Vincent Beliveau and Melanie Ganz and Desmond Elliott},
      year={2025},
      eprint={2509.04471},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.04471}, 
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
config		config
mosaic		mosaic
scripts		scripts
viz		viz
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
demo_delete.ipynb		demo_delete.ipynb
environment.yaml		environment.yaml
mosaic-icon.png		mosaic-icon.png
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MOSAIC: Multilingual, Taxonomy-Agnostic, and Computationally Efficient Radiological Report Classification

What’s in this repo

Setup

Data layout

Training

Evaluation and inference

Translate datasets

Perplexity utilities

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MOSAIC: Multilingual, Taxonomy-Agnostic, and Computationally Efficient Radiological Report Classification

What’s in this repo

Setup

Data layout

Training

Evaluation and inference

Translate datasets

Perplexity utilities

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages