Skip to content

Santhanu7Z/mci_detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fusion Mamba for Explainable Speech-Based Detection of Dementia

License: MIT Python 3.8+ IEEE JCSSE

Official code for the paper "Fusion Mamba for Explainable Speech-Based Detection of Mild Cognitive Impairment" (IEEE JCSSE 2026).

Key Finding: Linguistic features statistically dominate predictive performance across all corpora (mean attention weight 88.1% vs 11.9% for audio). Multimodal fusion's primary value is clinical interpretability — per-prediction modality attribution — rather than accuracy gains.


Overview

This repository implements a multimodal speech-based dementia detection pipeline combining:

  • Whisper Large-v3 for clinical-grade automatic transcription with hallucination filtering
  • Mamba-130M as a frozen linguistic encoder for efficient sequence modelling
  • eGeMAPS v02 acoustic biomarkers (88 features) extracted via OpenSMILE
  • Cross-modal attention fusion for per-prediction modality attribution
  • Multi-level XAI suite with MC word-level perturbation, feature-category analysis, and FDR-corrected biomarker testing

Evaluated across three benchmark corpora: Pitt (DementiaBank), ADReSS 2020, and TAUKADIAL.


Results

Full cross-corpus Weighted F1 scores across all 5 architectures and 4 training conditions (mean over 5-fold stratified CV grouped by participant ID). Bold = best per row.

Training Source Eval Target Linguistic Acoustic Interaction Gated Fusion Attn. Fusion
Pitt Pitt 0.928 0.667 0.923 0.919 0.932
Pitt ADReSS 0.812 0.563 0.813 0.825 0.814
Pitt TAUKADIAL 0.519 0.513 0.491 0.503 0.556
ADReSS Pitt 0.624 0.460 0.437 0.602 0.619
ADReSS ADReSS 0.918 0.654 0.811 0.897 0.945
ADReSS TAUKADIAL 0.432 0.385 0.357 0.520 0.484
TAUKADIAL Pitt 0.490 0.425 0.494 0.463 0.494
TAUKADIAL ADReSS 0.492 0.524 0.543 0.429 0.382
TAUKADIAL TAUKADIAL 0.832 0.631 0.809 0.913 0.793
Unified Pool Pitt 0.946 0.615 0.934 0.937 0.936
Unified Pool ADReSS 0.974 0.716 0.968 0.955 0.968
Unified Pool TAUKADIAL 0.919 0.681 0.914 0.899 0.908

Key takeaways:

  • Acoustic-only consistently underperforms across all conditions, confirming acoustic features alone are insufficient
  • Single-source cross-corpus transfer degrades severely (e.g. ADReSS→TAUKADIAL: 0.432–0.520)
  • Unified Pool training achieves the highest and most stable results across all targets, confirming data diversity generalises more robustly than architectural complexity
  • Linguistic-only and Attention Fusion are statistically indistinguishable (all pairwise p>0.05), supporting fusion as an interpretability contribution rather than an accuracy gain

Repository Structure

mci_detection/
│
├── data/                          # Place your datasets here (not included)
│   ├── Pitt/
│   ├── ADReSS/
│   └── TAUKADIAL/
│
├── processed_data/                # Generated by preprocessing scripts
│   ├── master_metadata.csv
│   ├── master_acoustic_features.csv
│   ├── transcripts_cache.json
│   └── cleaned_transcripts.json
│
├── logs/                          # SLURM job logs
│
├── preprocess_unified.py          # Stage 1: Audio segmentation + dataset merging
├── transcribe_all.py              # Stage 2: Whisper transcription pipeline
├── validate_asr.py                # Stage 3: ASR validation + hallucination filtering
├── pipeline_validator.py          # Stage 3b: WER evaluation + data cleaning
├── acoustic_extractor.py          # Stage 4a: Custom acoustic feature extraction
├── master_feature_extractor.py    # Stage 4b: eGeMAPS v02 feature extraction (primary)
│
├── train.py                       # Mamba linguistic-only training
├── acoustic_only.py               # eGeMAPS acoustic-only baseline
├── mamba_attention_fusion.py      # Main fusion model training
├── benchmark.py                   # 5-fold cross-corpus benchmark (all 5 architectures)
│
├── fusion_predict.py              # Inference on new audio files
├── explainability_suite.py        # XAI analysis (modality weights, word importance, etc.)
├── generate_master_report.py      # Statistical report generation
│
├── *.slurm                        # SLURM job scripts for HPC cluster
├── requirements.txt
└── README.md

Installation

Requirements

  • Python 3.8+
  • CUDA-capable GPU (tested on RTX 3060, RTX 4070 Ti)
  • ffmpeg (required for audio processing)

Setup

git clone https://github.com/Santhanu7Z/mci_detection.git
cd mci_detection

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

pip install -r requirements.txt

Install ffmpeg (if not already installed)

# Ubuntu/Debian
sudo apt install ffmpeg

# macOS
brew install ffmpeg

# Or download the binary and place it in ~/bin/

Datasets

This project uses three publicly available clinical datasets. You must request access individually:

Dataset Task Access
Pitt Corpus (DementiaBank) Cookie Theft picture description Request via TalkBank
ADReSS 2020 Cookie Theft picture description Request via TalkBank
TAUKADIAL Connected speech (Mandarin/English) Contact dataset authors

Once downloaded, place them under data/ following the structure above.


Pipeline

Run the stages in order. Each stage depends on the previous.

Stage 1 — Preprocess & Segment Audio

Isolates participant-only speech from interviewer scaffolding using CHAT timestamps. Standardises all audio to 16kHz mono.

python preprocess_unified.py --datasets pitt adress taukadial --output_dir processed_data

Stage 2 — Transcribe with Whisper

Transcribes all segmented audio using Whisper Large-v3 with beam search. Resumes automatically if interrupted.

python transcribe_all.py \
    --metadata processed_data/master_metadata.csv \
    --output processed_data/transcripts_cache.json \
    --model large-v3

Stage 3 — Validate ASR & Filter Hallucinations

Validates transcriptions against CHAT ground truth, detects trigram loops and low unique-token-ratio outputs, and exports a clean transcript cache for training.

python validate_asr.py

Stage 4 — Extract Acoustic Features

Extracts 88 eGeMAPS v02 features using OpenSMILE across all 868 samples.

pip install opensmile
python master_feature_extractor.py

Stage 5 — Train Models

Acoustic-only baseline:

python acoustic_only.py --epochs 40 --batch_size 16 --lr 3e-4

Mamba linguistic-only baseline:

python train.py --epochs 20 --batch_size 8 --lr 5e-4 --max_len 512

Attention Fusion model (primary):

python mamba_attention_fusion.py --epochs 20 --batch_size 8 --lr 5e-4

Stage 6 — Cross-Corpus Benchmark

Runs all 5 architectures × 4 training sources × 5-fold CV. Supports resumption if interrupted.

python benchmark.py

Stage 7 — XAI Analysis

Runs the full explainability suite: modality attribution, MC word-level perturbation, feature-category perturbation, and linguistic biomarker testing.

python explainability_suite.py

Results are saved to xai_results/.

Stage 8 — Generate Report

python generate_master_report.py

Inference on New Audio

Run prediction on a single WAV file:

python fusion_predict.py path/to/audio.wav --whisper small.en --threshold 0.5

Batch prediction on a directory:

python fusion_predict.py path/to/audio_dir/ --batch --whisper small.en

Output includes:

  • Dementia / Control classification with confidence scores
  • Modality attribution (linguistic % vs acoustic %)
  • Acoustic biomarker analysis (jitter, shimmer, HNR flags)
  • Linguistic markers (disfluency rate, response length)
  • Full transcript
  • JSON log saved to prediction_results.json

HPC / SLURM

SLURM scripts are provided for each stage. Tested on a cluster with RTX 3060 and RTX 4070 Ti partitions.

sbatch transcribe_all.slurm      # Whisper transcription (8h, 24GB)
sbatch train_mamba.slurm         # Mamba training (12h, 32GB)
sbatch train_acoustic.slurm      # Acoustic baseline (4h, 16GB)
sbatch train_fusion.slurm        # Fusion training (6h, 16GB)
sbatch run_cv_benchmark.slurm    # Full benchmark (12h, 32GB)
sbatch run_xai.slurm             # XAI suite (2h, 32GB)

Model Architecture

Clinical Audio
      │
      ├──► Whisper Large-v3 ──► Transcript ──► GPT-NeoX Tokenizer
      │                                               │
      │                                        Mamba-130M (frozen)
      │                                        Mean Pooling [768d]
      │                                               │
      │                                        Linear-LayerNorm-GELU
      │                                               │ [256d]
      │                                               │
      └──► OpenSMILE eGeMAPS v02 [88d] ──► 2-layer MLP
                                                      │ [256d]
                                                      │
                                    ┌─────────────────┘
                                    │
                             Stack [2 × 256d]
                                    │
                         4-head Multi-Head Attention
                                    │
                         Concat [512d] + Classifier
                                    │
                          Control / Dementia

Key Design Choices

  • Frozen Mamba backbone — justified by the bias-variance tradeoff for N=868 samples; prevents overfitting while preserving pre-trained linguistic representations
  • Participant-only segmentation — CHAT timestamps used to strip interviewer speech, preventing the model from learning clinician-dependent patterns
  • Hallucination filtering — trigram loop detection and unique-token-ratio thresholds remove corrupted ASR outputs before training
  • Leak-free standardisation — acoustic feature scaling is fit exclusively on the training partition of each fold
  • Grouped CVStratifiedGroupKFold on participant ID ensures no speaker appears in both train and eval splits

Citation

If you use this code, please cite:

@inproceedings{santhanuajithkumar2026fusionmamba,
  title     = {Fusion Mamba for Explainable Speech-Based Detection of Mild Cognitive Impairment},
  author    = {Santhanu Ajith Kumar, Nuttanart Muansuwan, Princy Raj},
  booktitle = {Proceedings of the International Joint Conference on Computer Science and Software Engineering (JCSSE)},
  year      = {2026}
}

License

This project is licensed under the MIT License — see LICENSE for details.


Acknowledgements

About

Listening Between the Lines: An explainable multimodal framework for MCI detection from spontaneous speech. Leverages Selective State Space Models (Mamba) and Gated Fusion to integrate linguistic disfluencies and eGeMAPS biomarkers across multi-corpus benchmarks (Pitt, ADReSS, TAUKADIAL)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors