Official code for the paper "Fusion Mamba for Explainable Speech-Based Detection of Mild Cognitive Impairment" (IEEE JCSSE 2026).
Key Finding: Linguistic features statistically dominate predictive performance across all corpora (mean attention weight 88.1% vs 11.9% for audio). Multimodal fusion's primary value is clinical interpretability — per-prediction modality attribution — rather than accuracy gains.
This repository implements a multimodal speech-based dementia detection pipeline combining:
- Whisper Large-v3 for clinical-grade automatic transcription with hallucination filtering
- Mamba-130M as a frozen linguistic encoder for efficient sequence modelling
- eGeMAPS v02 acoustic biomarkers (88 features) extracted via OpenSMILE
- Cross-modal attention fusion for per-prediction modality attribution
- Multi-level XAI suite with MC word-level perturbation, feature-category analysis, and FDR-corrected biomarker testing
Evaluated across three benchmark corpora: Pitt (DementiaBank), ADReSS 2020, and TAUKADIAL.
Full cross-corpus Weighted F1 scores across all 5 architectures and 4 training conditions (mean over 5-fold stratified CV grouped by participant ID). Bold = best per row.
| Training Source | Eval Target | Linguistic | Acoustic | Interaction | Gated Fusion | Attn. Fusion |
|---|---|---|---|---|---|---|
| Pitt | Pitt | 0.928 | 0.667 | 0.923 | 0.919 | 0.932 |
| Pitt | ADReSS | 0.812 | 0.563 | 0.813 | 0.825 | 0.814 |
| Pitt | TAUKADIAL | 0.519 | 0.513 | 0.491 | 0.503 | 0.556 |
| ADReSS | Pitt | 0.624 | 0.460 | 0.437 | 0.602 | 0.619 |
| ADReSS | ADReSS | 0.918 | 0.654 | 0.811 | 0.897 | 0.945 |
| ADReSS | TAUKADIAL | 0.432 | 0.385 | 0.357 | 0.520 | 0.484 |
| TAUKADIAL | Pitt | 0.490 | 0.425 | 0.494 | 0.463 | 0.494 |
| TAUKADIAL | ADReSS | 0.492 | 0.524 | 0.543 | 0.429 | 0.382 |
| TAUKADIAL | TAUKADIAL | 0.832 | 0.631 | 0.809 | 0.913 | 0.793 |
| Unified Pool | Pitt | 0.946 | 0.615 | 0.934 | 0.937 | 0.936 |
| Unified Pool | ADReSS | 0.974 | 0.716 | 0.968 | 0.955 | 0.968 |
| Unified Pool | TAUKADIAL | 0.919 | 0.681 | 0.914 | 0.899 | 0.908 |
Key takeaways:
- Acoustic-only consistently underperforms across all conditions, confirming acoustic features alone are insufficient
- Single-source cross-corpus transfer degrades severely (e.g. ADReSS→TAUKADIAL: 0.432–0.520)
- Unified Pool training achieves the highest and most stable results across all targets, confirming data diversity generalises more robustly than architectural complexity
- Linguistic-only and Attention Fusion are statistically indistinguishable (all pairwise p>0.05), supporting fusion as an interpretability contribution rather than an accuracy gain
mci_detection/
│
├── data/ # Place your datasets here (not included)
│ ├── Pitt/
│ ├── ADReSS/
│ └── TAUKADIAL/
│
├── processed_data/ # Generated by preprocessing scripts
│ ├── master_metadata.csv
│ ├── master_acoustic_features.csv
│ ├── transcripts_cache.json
│ └── cleaned_transcripts.json
│
├── logs/ # SLURM job logs
│
├── preprocess_unified.py # Stage 1: Audio segmentation + dataset merging
├── transcribe_all.py # Stage 2: Whisper transcription pipeline
├── validate_asr.py # Stage 3: ASR validation + hallucination filtering
├── pipeline_validator.py # Stage 3b: WER evaluation + data cleaning
├── acoustic_extractor.py # Stage 4a: Custom acoustic feature extraction
├── master_feature_extractor.py # Stage 4b: eGeMAPS v02 feature extraction (primary)
│
├── train.py # Mamba linguistic-only training
├── acoustic_only.py # eGeMAPS acoustic-only baseline
├── mamba_attention_fusion.py # Main fusion model training
├── benchmark.py # 5-fold cross-corpus benchmark (all 5 architectures)
│
├── fusion_predict.py # Inference on new audio files
├── explainability_suite.py # XAI analysis (modality weights, word importance, etc.)
├── generate_master_report.py # Statistical report generation
│
├── *.slurm # SLURM job scripts for HPC cluster
├── requirements.txt
└── README.md
- Python 3.8+
- CUDA-capable GPU (tested on RTX 3060, RTX 4070 Ti)
- ffmpeg (required for audio processing)
git clone https://github.com/Santhanu7Z/mci_detection.git
cd mci_detection
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt# Ubuntu/Debian
sudo apt install ffmpeg
# macOS
brew install ffmpeg
# Or download the binary and place it in ~/bin/This project uses three publicly available clinical datasets. You must request access individually:
| Dataset | Task | Access |
|---|---|---|
| Pitt Corpus (DementiaBank) | Cookie Theft picture description | Request via TalkBank |
| ADReSS 2020 | Cookie Theft picture description | Request via TalkBank |
| TAUKADIAL | Connected speech (Mandarin/English) | Contact dataset authors |
Once downloaded, place them under data/ following the structure above.
Run the stages in order. Each stage depends on the previous.
Isolates participant-only speech from interviewer scaffolding using CHAT timestamps. Standardises all audio to 16kHz mono.
python preprocess_unified.py --datasets pitt adress taukadial --output_dir processed_dataTranscribes all segmented audio using Whisper Large-v3 with beam search. Resumes automatically if interrupted.
python transcribe_all.py \
--metadata processed_data/master_metadata.csv \
--output processed_data/transcripts_cache.json \
--model large-v3Validates transcriptions against CHAT ground truth, detects trigram loops and low unique-token-ratio outputs, and exports a clean transcript cache for training.
python validate_asr.pyExtracts 88 eGeMAPS v02 features using OpenSMILE across all 868 samples.
pip install opensmile
python master_feature_extractor.pyAcoustic-only baseline:
python acoustic_only.py --epochs 40 --batch_size 16 --lr 3e-4Mamba linguistic-only baseline:
python train.py --epochs 20 --batch_size 8 --lr 5e-4 --max_len 512Attention Fusion model (primary):
python mamba_attention_fusion.py --epochs 20 --batch_size 8 --lr 5e-4Runs all 5 architectures × 4 training sources × 5-fold CV. Supports resumption if interrupted.
python benchmark.pyRuns the full explainability suite: modality attribution, MC word-level perturbation, feature-category perturbation, and linguistic biomarker testing.
python explainability_suite.pyResults are saved to xai_results/.
python generate_master_report.pyRun prediction on a single WAV file:
python fusion_predict.py path/to/audio.wav --whisper small.en --threshold 0.5Batch prediction on a directory:
python fusion_predict.py path/to/audio_dir/ --batch --whisper small.enOutput includes:
- Dementia / Control classification with confidence scores
- Modality attribution (linguistic % vs acoustic %)
- Acoustic biomarker analysis (jitter, shimmer, HNR flags)
- Linguistic markers (disfluency rate, response length)
- Full transcript
- JSON log saved to
prediction_results.json
SLURM scripts are provided for each stage. Tested on a cluster with RTX 3060 and RTX 4070 Ti partitions.
sbatch transcribe_all.slurm # Whisper transcription (8h, 24GB)
sbatch train_mamba.slurm # Mamba training (12h, 32GB)
sbatch train_acoustic.slurm # Acoustic baseline (4h, 16GB)
sbatch train_fusion.slurm # Fusion training (6h, 16GB)
sbatch run_cv_benchmark.slurm # Full benchmark (12h, 32GB)
sbatch run_xai.slurm # XAI suite (2h, 32GB)Clinical Audio
│
├──► Whisper Large-v3 ──► Transcript ──► GPT-NeoX Tokenizer
│ │
│ Mamba-130M (frozen)
│ Mean Pooling [768d]
│ │
│ Linear-LayerNorm-GELU
│ │ [256d]
│ │
└──► OpenSMILE eGeMAPS v02 [88d] ──► 2-layer MLP
│ [256d]
│
┌─────────────────┘
│
Stack [2 × 256d]
│
4-head Multi-Head Attention
│
Concat [512d] + Classifier
│
Control / Dementia
- Frozen Mamba backbone — justified by the bias-variance tradeoff for N=868 samples; prevents overfitting while preserving pre-trained linguistic representations
- Participant-only segmentation — CHAT timestamps used to strip interviewer speech, preventing the model from learning clinician-dependent patterns
- Hallucination filtering — trigram loop detection and unique-token-ratio thresholds remove corrupted ASR outputs before training
- Leak-free standardisation — acoustic feature scaling is fit exclusively on the training partition of each fold
- Grouped CV —
StratifiedGroupKFoldon participant ID ensures no speaker appears in both train and eval splits
If you use this code, please cite:
@inproceedings{santhanuajithkumar2026fusionmamba,
title = {Fusion Mamba for Explainable Speech-Based Detection of Mild Cognitive Impairment},
author = {Santhanu Ajith Kumar, Nuttanart Muansuwan, Princy Raj},
booktitle = {Proceedings of the International Joint Conference on Computer Science and Software Engineering (JCSSE)},
year = {2026}
}This project is licensed under the MIT License — see LICENSE for details.
- Mamba — Gu & Dao, 2023
- Whisper — Radford et al., 2022
- OpenSMILE — Eyben et al., 2010
- DementiaBank / Pitt Corpus — Becker et al., 1994
- ADReSS 2020 — Luz et al., 2020
- TAUKADIAL — Luz et al., 2024