Fusion Mamba for Explainable Speech-Based Detection of Dementia

Official code for the paper "Fusion Mamba for Explainable Speech-Based Detection of Mild Cognitive Impairment" (IEEE JCSSE 2026).

Key Finding: Linguistic features statistically dominate predictive performance across all corpora (mean attention weight 88.1% vs 11.9% for audio). Multimodal fusion's primary value is clinical interpretability — per-prediction modality attribution — rather than accuracy gains.

Overview

This repository implements a multimodal speech-based dementia detection pipeline combining:

Whisper Large-v3 for clinical-grade automatic transcription with hallucination filtering
Mamba-130M as a frozen linguistic encoder for efficient sequence modelling
eGeMAPS v02 acoustic biomarkers (88 features) extracted via OpenSMILE
Cross-modal attention fusion for per-prediction modality attribution
Multi-level XAI suite with MC word-level perturbation, feature-category analysis, and FDR-corrected biomarker testing

Evaluated across three benchmark corpora: Pitt (DementiaBank), ADReSS 2020, and TAUKADIAL.

Results

Full cross-corpus Weighted F1 scores across all 5 architectures and 4 training conditions (mean over 5-fold stratified CV grouped by participant ID). Bold = best per row.

Training Source	Eval Target	Linguistic	Acoustic	Interaction	Gated Fusion	Attn. Fusion
Pitt	Pitt	0.928	0.667	0.923	0.919	0.932
Pitt	ADReSS	0.812	0.563	0.813	0.825	0.814
Pitt	TAUKADIAL	0.519	0.513	0.491	0.503	0.556
ADReSS	Pitt	0.624	0.460	0.437	0.602	0.619
ADReSS	ADReSS	0.918	0.654	0.811	0.897	0.945
ADReSS	TAUKADIAL	0.432	0.385	0.357	0.520	0.484
TAUKADIAL	Pitt	0.490	0.425	0.494	0.463	0.494
TAUKADIAL	ADReSS	0.492	0.524	0.543	0.429	0.382
TAUKADIAL	TAUKADIAL	0.832	0.631	0.809	0.913	0.793
Unified Pool	Pitt	0.946	0.615	0.934	0.937	0.936
Unified Pool	ADReSS	0.974	0.716	0.968	0.955	0.968
Unified Pool	TAUKADIAL	0.919	0.681	0.914	0.899	0.908

Key takeaways:

Acoustic-only consistently underperforms across all conditions, confirming acoustic features alone are insufficient
Single-source cross-corpus transfer degrades severely (e.g. ADReSS→TAUKADIAL: 0.432–0.520)
Unified Pool training achieves the highest and most stable results across all targets, confirming data diversity generalises more robustly than architectural complexity
Linguistic-only and Attention Fusion are statistically indistinguishable (all pairwise p>0.05), supporting fusion as an interpretability contribution rather than an accuracy gain

Repository Structure

mci_detection/
│
├── data/                          # Place your datasets here (not included)
│   ├── Pitt/
│   ├── ADReSS/
│   └── TAUKADIAL/
│
├── processed_data/                # Generated by preprocessing scripts
│   ├── master_metadata.csv
│   ├── master_acoustic_features.csv
│   ├── transcripts_cache.json
│   └── cleaned_transcripts.json
│
├── logs/                          # SLURM job logs
│
├── preprocess_unified.py          # Stage 1: Audio segmentation + dataset merging
├── transcribe_all.py              # Stage 2: Whisper transcription pipeline
├── validate_asr.py                # Stage 3: ASR validation + hallucination filtering
├── pipeline_validator.py          # Stage 3b: WER evaluation + data cleaning
├── acoustic_extractor.py          # Stage 4a: Custom acoustic feature extraction
├── master_feature_extractor.py    # Stage 4b: eGeMAPS v02 feature extraction (primary)
│
├── train.py                       # Mamba linguistic-only training
├── acoustic_only.py               # eGeMAPS acoustic-only baseline
├── mamba_attention_fusion.py      # Main fusion model training
├── benchmark.py                   # 5-fold cross-corpus benchmark (all 5 architectures)
│
├── fusion_predict.py              # Inference on new audio files
├── explainability_suite.py        # XAI analysis (modality weights, word importance, etc.)
├── generate_master_report.py      # Statistical report generation
│
├── *.slurm                        # SLURM job scripts for HPC cluster
├── requirements.txt
└── README.md

Installation

Requirements

Python 3.8+
CUDA-capable GPU (tested on RTX 3060, RTX 4070 Ti)
ffmpeg (required for audio processing)

Setup

git clone https://github.com/Santhanu7Z/mci_detection.git
cd mci_detection

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

pip install -r requirements.txt

Install ffmpeg (if not already installed)

# Ubuntu/Debian
sudo apt install ffmpeg

# macOS
brew install ffmpeg

# Or download the binary and place it in ~/bin/

Datasets

This project uses three publicly available clinical datasets. You must request access individually:

Dataset	Task	Access
Pitt Corpus (DementiaBank)	Cookie Theft picture description	Request via TalkBank
ADReSS 2020	Cookie Theft picture description	Request via TalkBank
TAUKADIAL	Connected speech (Mandarin/English)	Contact dataset authors

Once downloaded, place them under data/ following the structure above.

Pipeline

Run the stages in order. Each stage depends on the previous.

Stage 1 — Preprocess & Segment Audio

Isolates participant-only speech from interviewer scaffolding using CHAT timestamps. Standardises all audio to 16kHz mono.

python preprocess_unified.py --datasets pitt adress taukadial --output_dir processed_data

Stage 2 — Transcribe with Whisper

Transcribes all segmented audio using Whisper Large-v3 with beam search. Resumes automatically if interrupted.

python transcribe_all.py \
    --metadata processed_data/master_metadata.csv \
    --output processed_data/transcripts_cache.json \
    --model large-v3

Stage 3 — Validate ASR & Filter Hallucinations

Validates transcriptions against CHAT ground truth, detects trigram loops and low unique-token-ratio outputs, and exports a clean transcript cache for training.

python validate_asr.py

Stage 4 — Extract Acoustic Features

Extracts 88 eGeMAPS v02 features using OpenSMILE across all 868 samples.

pip install opensmile
python master_feature_extractor.py

Stage 5 — Train Models

Acoustic-only baseline:

python acoustic_only.py --epochs 40 --batch_size 16 --lr 3e-4

Mamba linguistic-only baseline:

python train.py --epochs 20 --batch_size 8 --lr 5e-4 --max_len 512

Attention Fusion model (primary):

python mamba_attention_fusion.py --epochs 20 --batch_size 8 --lr 5e-4

Stage 6 — Cross-Corpus Benchmark

Runs all 5 architectures × 4 training sources × 5-fold CV. Supports resumption if interrupted.

python benchmark.py

Stage 7 — XAI Analysis

Runs the full explainability suite: modality attribution, MC word-level perturbation, feature-category perturbation, and linguistic biomarker testing.

python explainability_suite.py

Results are saved to xai_results/.

Stage 8 — Generate Report

python generate_master_report.py

Inference on New Audio

Run prediction on a single WAV file:

python fusion_predict.py path/to/audio.wav --whisper small.en --threshold 0.5

Batch prediction on a directory:

python fusion_predict.py path/to/audio_dir/ --batch --whisper small.en

Output includes:

Dementia / Control classification with confidence scores
Modality attribution (linguistic % vs acoustic %)
Acoustic biomarker analysis (jitter, shimmer, HNR flags)
Linguistic markers (disfluency rate, response length)
Full transcript
JSON log saved to prediction_results.json

HPC / SLURM

SLURM scripts are provided for each stage. Tested on a cluster with RTX 3060 and RTX 4070 Ti partitions.

sbatch transcribe_all.slurm      # Whisper transcription (8h, 24GB)
sbatch train_mamba.slurm         # Mamba training (12h, 32GB)
sbatch train_acoustic.slurm      # Acoustic baseline (4h, 16GB)
sbatch train_fusion.slurm        # Fusion training (6h, 16GB)
sbatch run_cv_benchmark.slurm    # Full benchmark (12h, 32GB)
sbatch run_xai.slurm             # XAI suite (2h, 32GB)

Model Architecture

Clinical Audio
      │
      ├──► Whisper Large-v3 ──► Transcript ──► GPT-NeoX Tokenizer
      │                                               │
      │                                        Mamba-130M (frozen)
      │                                        Mean Pooling [768d]
      │                                               │
      │                                        Linear-LayerNorm-GELU
      │                                               │ [256d]
      │                                               │
      └──► OpenSMILE eGeMAPS v02 [88d] ──► 2-layer MLP
                                                      │ [256d]
                                                      │
                                    ┌─────────────────┘
                                    │
                             Stack [2 × 256d]
                                    │
                         4-head Multi-Head Attention
                                    │
                         Concat [512d] + Classifier
                                    │
                          Control / Dementia

Key Design Choices

Frozen Mamba backbone — justified by the bias-variance tradeoff for N=868 samples; prevents overfitting while preserving pre-trained linguistic representations
Participant-only segmentation — CHAT timestamps used to strip interviewer speech, preventing the model from learning clinician-dependent patterns
Hallucination filtering — trigram loop detection and unique-token-ratio thresholds remove corrupted ASR outputs before training
Leak-free standardisation — acoustic feature scaling is fit exclusively on the training partition of each fold
Grouped CV — StratifiedGroupKFold on participant ID ensures no speaker appears in both train and eval splits

Citation

If you use this code, please cite:

@inproceedings{santhanuajithkumar2026fusionmamba,
  title     = {Fusion Mamba for Explainable Speech-Based Detection of Mild Cognitive Impairment},
  author    = {Santhanu Ajith Kumar, Nuttanart Muansuwan, Princy Raj},
  booktitle = {Proceedings of the International Joint Conference on Computer Science and Software Engineering (JCSSE)},
  year      = {2026}
}

License

This project is licensed under the MIT License — see LICENSE for details.

Acknowledgements

Mamba — Gu & Dao, 2023
Whisper — Radford et al., 2022
OpenSMILE — Eyben et al., 2010
DementiaBank / Pitt Corpus — Becker et al., 1994
ADReSS 2020 — Luz et al., 2020
TAUKADIAL — Luz et al., 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fusion Mamba for Explainable Speech-Based Detection of Dementia

Overview

Results

Repository Structure

Installation

Requirements

Setup

Install ffmpeg (if not already installed)

Datasets

Pipeline

Stage 1 — Preprocess & Segment Audio

Stage 2 — Transcribe with Whisper

Stage 3 — Validate ASR & Filter Hallucinations

Stage 4 — Extract Acoustic Features

Stage 5 — Train Models

Stage 6 — Cross-Corpus Benchmark

Stage 7 — XAI Analysis

Stage 8 — Generate Report

Inference on New Audio

HPC / SLURM

Model Architecture

Key Design Choices

Citation

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
acoustic_extractor.py		acoustic_extractor.py
acoustic_only.py		acoustic_only.py
benchmark.py		benchmark.py
explainability_suite.py		explainability_suite.py
fusion_predict.py		fusion_predict.py
generate_master_report.py		generate_master_report.py
mamba_attention_fusion.py		mamba_attention_fusion.py
master_feature_extractor.py		master_feature_extractor.py
pipeline_validator.py		pipeline_validator.py
preprocess_unified.py		preprocess_unified.py
requirements.txt		requirements.txt
run_cv_benchmark.slurm		run_cv_benchmark.slurm
run_xai.slurm		run_xai.slurm
train_acoustic.slurm		train_acoustic.slurm
train_fusion.slurm		train_fusion.slurm
train_mamba.slurm		train_mamba.slurm
transcribe_all.py		transcribe_all.py
transcribe_all.slurm		transcribe_all.slurm
validate_asr.py		validate_asr.py

Folders and files

Latest commit

History

Repository files navigation

Fusion Mamba for Explainable Speech-Based Detection of Dementia

Overview

Results

Repository Structure

Installation

Requirements

Setup

Install ffmpeg (if not already installed)

Datasets

Pipeline

Stage 1 — Preprocess & Segment Audio

Stage 2 — Transcribe with Whisper

Stage 3 — Validate ASR & Filter Hallucinations

Stage 4 — Extract Acoustic Features

Stage 5 — Train Models

Stage 6 — Cross-Corpus Benchmark

Stage 7 — XAI Analysis

Stage 8 — Generate Report

Inference on New Audio

HPC / SLURM

Model Architecture

Key Design Choices

Citation

License

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages