Skip to content

CarloMengucci/WISDoM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 

Repository files navigation

WISDoM

Wishart Distributed Matrices Multiple Order Classification Method, Pipeline and Utilities

Publication: https://doi.org/10.3389/fninf.2020.611762

In this work we introduce the Wishart Distributed Matrices Multiple Order Classification (WISDoM) method. The WISDoM Classification method consists of a pipeline for single feature analysis, supervised learning, cross validation and classification for any problems whose elements can be tied to a symmetric positive-definite matrix representation. The general idea is for information about properties of a certain system contained in a symmetric positive-definite matrix representation (i.e covariance and correlation matrices) to be extracted by modelling an estimated distribution for the expected classes of a given problem. The application to fMRI data classification and clustering processing follows naturally: the WISDoM classification method has been tested on the ADNI2 (Alzheimer's Disease Neuroimaging Initiative) database. The goal was to achieve good classification performances between Alzheimer's Disease diagnosed patients (AD) and Normal Control (NC) subjects, while retaining information on which features were the most informative decision-wise. In our work, the information about topological properties contained in ADNI2 functional correlation matrices are extracted by modelling an estimated Wishart distribution for the expected diagnostical groups AD and NC, and allowed a complete separation between the two groups.

The Method

The main idea for the WISDoM Classifier is to use the free parameters of the Wishart distribution to compute an estimation of the distribution for a certain class of elements, represented via positive symetric-definite matrices, and then assign a single element to a given class by computing some sort of "distance" between the element being analyzed and the classes. Furhermore, if we assume that the matrices are somehow representative of the features of the system studied (i.e. covariance matrices might be taken into account), a score can be assigned to each feature by estimating the weight of said feature in terms of Log Likelihood Ratio. In other words, a score can be assigned to each feature by analyzing the variation in terms of LogLikelihood caused by the deletion of it. If the deletion of a feature causes significant increase (or decrease) in the LogLikelihood computed with respect to the estimated distributions for the classes, it can be stated that said feature is highly representative of the system analyzed.

For a complete mathematical description of WISDoM and the Wishart Distribution see: WISDoM-Complete

Snakemake Pipeline

The main tool used to develope a parallel and optimized pipeline is the Snakemake Workflow Management System, a Python-based interface created to build reproducible and scalable data analyses and machine-learning routines. To briefly sum up the the advantages of using such tools and structures, the Snakemake Workflow can be described as rules that denote how to create output files from input files. The workflow is implied by dependencies between the rules that arise from one rule needing an output file of another as an input file.

General Pipeline Summary Visualization:

General-Pipeline

Sample Pipeline Rules Execution DAG:

General-Pipeline

To see the complete pipeline for the ADNI Database analysis developed with Snakemake please look at: ADNI_Snakefile

Running the Snakefile

Snakemake offers simple multi-core running options. Please note that some of the scipy.stats functions used tend to exploit all the available resources, in conflict with the Snakemake attempt to manage threads. To avoid this, just force single-threading by setting environment with the following line:

export OMP_NUM_THREADS=1

After moving to the folder containing the Snakefile (i.e. ADNI_Snakefile), a dry run to check if the workflow is properly defined can be performed by using:

snakemake -n

To launch the pipeline over the desired number of cores use:

snakemake -j n_cores

Snakemake will then use up to n cores and try to solve a binary knapsack problem to optimize the scheduling of jobs. To visualize the pipeline DAG (Direct Acyclic Graph) and save it in the desired format through the graphviz dot tool use:

snakemake --dag | dot -Tpdf > dag.pdf

Data Formats

The WISDoM pipeline is compatible with .hdf tabulated data. Each row must be an entry of the database and each column an element of the upper (or lower) triangle of the symmetric positive-definite matrix associated to said entry, excluding the diagonal elements. Module head_wrap.py contains useful functions and wrappers to associate labels to entries from existing .csv files, while module gen_fs.py contains functions to reconstruct the NumPy tensor from a Pandas Dataframe obtained by reading the .hdf file.

Requisites

In order to successfully use the WISDoM pipeline, the Snakemake Environment must be correctly set up. Furthermore, the modules relies on scipy.stats for Wishart sampling generation and scikit.learn for training and classification; Pandas is also required for Dataframe creation and handling.

About

Wishart Distributed Matrices Multiple Order Classification, Pipeline and Utilities

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages