License: This repository is licensed under the MIT License.
SlideFlame-Vanilla is a Flamingo-inspired [4] vision-language model tailored for digital pathology. It integrates a pretrained language model (BioGPT-Large) with visual context from whole-slide image (WSI) features using gated cross-attention layers.
We implement a vision-language architecture inspired by recent models such as PRISM [1] and HistoGPT [2]. A pretrained language model (BioGPT) [5] is augmented with cross-attention layers to receive context from WSI-derived image features.
Rather than using raw image pixels, we extract patch-level features using the CONCHv1.5 [3] encoder. These are processed in a multiple instance learning (MIL) setup before being passed to the language model.
- Learnable gates: We retain the gated cross-attention modules (i.e.,
attn_gate,ff_gate) from Flamingo. Unlike the original Flamingo implementation, we initializeattn_gateto 0.55, allowing partial vision-language interaction at the start of training. - Custom parameter grouping: Gated parameters are trained with a separate learning rate (
gate_lr) using a custom optimizer grouping strategy.
git clone [https://github.com/KatherLab/slideFlame_Vanilla.git]
cd slideFlame_Vanilla
pip install .