Multimodal explainable AI framework combining CLIP and CNNs to reveal concept-level bias and interpretability in deep vision models.
Official implementation of the paper:
Patrick Koller¹
Amil Dravid² (also published as Amil V. Dravid)
Guido M. Schuster³
Aggelos K. Katsaggelos¹¹Northwestern University | ²UC Berkeley | ³Eastern Switzerland University of Applied Sciences
🏔️ Presented at IEEE ICIP 2025, Anchorage (Alaska)
Deep neural networks have transformed computer vision, achieving remarkable accuracy in recognition, detection, and classification tasks.
However, understanding why a network makes a specific decision remains one of the central challenges in AI.
This repository introduces a multimodal explainable AI (XAI) framework that bridges vision and language using OpenAI's CLIP.
Through a process called network surgery, it reveals the semantic concepts driving model predictions and exposes hidden biases within learned representations.
💡 Unlike pixel-based saliency methods, our approach:
- Explains what concept drives a prediction, not just where the model looked
- Identifies spurious correlations such as color or texture bias
- Provides quantitative insight into robustness and covariate shift

Conceptual overview: bridging CLIP and a standalone model to uncover the semantics behind decisions.
This repository contains:
- ✅ Full inference pipeline for caption-driven XAI
- ✅ CLIP-based probing utilities
- ✅ Network surgery implementation
- ✅ Bias visualization assets
- ✅ Example datasets & scripts
We integrate a standalone model to be explained (for example ResNet-50) into CLIP by aligning their activation maps.
CLIP’s text encoder then serves as a semantic probe, describing what the model has truly learned.
- Network surgery – Swap correlated activation maps between the standalone model and CLIP
- Activation matching – Compute cross-layer correlations to identify equivalent feature spaces
- Caption-based inference – Use natural-language captions (e.g. “red digit”, “green digit”, “round shape”) to interpret dominant concepts

Activation matching aligns internal feature spaces for interpretable concept fusion.
Both Grad-CAM and Caption-Driven XAI offer valuable insights, but they answer different questions.
| Method | Explains | Handles overlapping features | Quantitative concept analysis | Human-readable output |
|---|---|---|---|---|
| Grad-CAM | Spatial importance (where) | ❌ | ❌ | ❌ |
| Caption-Driven XAI | Conceptual semantics (what) | ✅ | ✅ | ✅ |
Grad-CAM highlights the region of attention, while Caption-Driven XAI uncovers the reason, bridging visual focus with linguistic meaning.
Quantitative concept analysis refers to measuring how strongly each linguistic concept (e.g. “red”, “round”) influences a model’s prediction, based on similarity in CLIP’s multimodal embedding space.
If you use this repository, please cite:
@inproceedings{koller2025captionxai,
title={Caption-Driven Explainability: Probing CNNs for Bias via CLIP},
author={Koller, Patrick and Dravid, Amil V. and Schuster, Guido M. and Katsaggelos, Aggelos K.},
booktitle={IEEE International Conference on Image Processing (ICIP) – Satellite Workshop on Generative AI for World Simulations and Communications},
year={2025},
note={Preprint available at arXiv:2510.22035}
}- 📄 arXiv preprint: https://arxiv.org/abs/2510.22035
- 🧪 Zenodo archive (v1.0.0): https://doi.org/10.5281/zenodo.17546054
- 👤 Personal website: https://patch0816.github.io
- 🎓 Google Scholar: https://scholar.google.com/citations?user=jMiy9HQAAAAJ&hl=en
This research was conducted at the AIM-IVPL Lab (Northwestern University),
in collaboration with UC Berkeley and OST/ICAI Switzerland.