This repository contains the resources, taxonomy, and data associated with our survey paper: "Bias in Large AI Models for Medicine and Healthcare: Survey and Challenges".
Large AI models (including LLMs, LVMs, and LMMs) are transforming healthcare, yet they risk perpetuating or amplifying medical biases. This project provides a comprehensive synthesis of 55 representative studies, organizing the literature into a clear taxonomy of bias, detection methods, and mitigation strategies.
Figure 1: An overview of bias in Large AI models for medicine and healthcare.
- Taxonomy: A dual taxonomy categorizing bias by Medical Scenarios (e.g., triage, education) and Clinical Specialties (e.g., cardiology, oncology).
- Resources: A structured index of Large AI Models and Datasets used in bias research.
- Methodology: A review of current techniques for Bias Detection (e.g., counterfactual testing) and Mitigation (pre-, in-, and post-processing).
- Future Directions: Identification of open problems such as the fairness-accuracy trade-off and global health inequities.
We categorize medical bias along two principal axes to facilitate precise identification and mitigation.
- Clinical Decision Support: Disparities in diagnostic reasoning or treatment planning.
- Patient Communication: Biased triage advice or health counseling via chatbots.
- Medical Documentation: Stereotypes or hallucinations in report generation and summarization.
- Medical Education: Misrepresentation in generated case vignettes or training materials.
Our survey covers biases identified in specific domains, including:
- 🫀 Cardiology
- 🫁 Pulmonology
- 🦀 Oncology
- 🦠 Infectious Disease
- 👁️ Ophthalmology
- 🧠 Mental Health & Psychiatry
Below are selected models analyzed in the survey.
| Model Name | Family | Parameter Size | Open Source? |
|---|---|---|---|
| GPT-4 | GPT | ≥ 175B | No |
| GPT-3.5 | GPT | ≥ 175B | No |
| Claude-3.5 | Claude | ≥ 175B | No |
| LlaMa-3 | LlaMa | ≥ 175B | Yes |
| Qwen-2.5 | Qwen | ≥ 175B | Yes |
| Deepseek-V3 | Deepseek | ≥ 175B | Yes |
Below are selected models analyzed in the survey.
| Model Name | Family | Parameter Size | Open Source? |
|---|---|---|---|
| Med-PaLM 2 | PaLM 2 | ≥ 175B | No |
| Meditron | LlaMa-2 | 70B-175B | Yes |
| PMC-LlaMa | LlaMa | 10B-70B | Yes |
| LLaVA-Med | LLaVA | 1B-10B | Yes |
| ClinicalBERT | ClinicalBERT | < 1B | Yes |
We have compiled datasets across three modalities: Text, Image, and Multimodal.
- Text: MedQA, PubMedQA, MIMIC-IV, AMQA, BiasMD.
- Image: CheXpert, MIMIC-CXR, HAM10000, ODIR, Fitzpatrick17k.
- Multimodal: LLaVA-Med, ROCO, PMC-OA.
- Input Generation: Creating synthetic patients or mutating existing clinical vignettes (e.g., changing "Male" to "Female").
- Evaluation Metrics:
- Answer Consistency: Measuring robustness across demographic changes.
- Fairness Metrics: Demographic Parity, Equalized Odds.
- Human Expert Assessment: Physician review for complex scenarios.
- Pre-processing: Data augmentation and rebalancing before training.
- In-processing: Model fine-tuning (e.g., FairCLIP), loss function modification.
- Post-processing: Prompt engineering (Chain-of-Thought), output rewriting, and ensembling.
Based on our analysis, we highlight the following urgent research directions:
- Unified Foundations: Defining "medical fairness" distinct from general AI fairness.
- Standardized Benchmarks: Moving beyond ad-hoc testing to rigorous, scalable benchmarks.
- Real-World Validation: Continuous monitoring of models in deployed clinical settings.
- Global Health Equity: Addressing the lack of representation for non-Western populations and languages.
- Fairness-Accuracy Trade-off: Investigating how debiasing affects diagnostic performance.
If you find this survey or repository helpful, please cite our work:
@article{xiao2025bias,
title={Bias in Large AI Models for Medicine and Healthcare: Survey and Challenges},
author={Xiao, Ying and Chen, Zhenpeng and Huang, Jen-tse and Chen, Wenting and Liu, Yepang and Li, Kezhi and Mousavi, Mohammadreza and Dobson, Richard and Zhang, Jie},
year={2025}
}This Readme file is generated by Gemini-3