GPML

A benchmarking pipeline for evaluating protein variant effect prediction tools against MAVE (Multiplex Assay of Variant Effect) gold standard data.

Overview

GPML computes Spearman rank correlations between experimental MAVE measurements and scores from 9 computational prediction tools. While the MAVE gold standard includes 36 proteins spanning multiple organisms (human, bacteria, virus, yeast, E. coli), the study is limited to human proteins — dbNSFP, the source of tool scores, only contains annotations for human variants. It accounts for training set overlap to enable unbiased performance comparisons and generates publication-ready plots and summary tables.

Prediction tools evaluated: MutPred, REVEL, EVE, AlphaMissense, DEOGEN2, ClinPred, PrimateAI, FATHMM, MutationTaster

Project Structure

GPML/
├── Snakefile                               # Pipeline entry point
├── install.sh                              # Database download and setup script
├── gpml_env.yml                            # Full monolithic conda environment
│
├── config/
│   └── config.yaml                         # Pipeline paths and parameters
│
├── src/                                    # Core library modules
│   ├── constants.py                        # Global config, column names, protein mappings
│   ├── utils.py                            # Shared utility functions
│   ├── dataframe_preprocessor.py           # MAVE data processing and tool score integration
│   ├── baseline_calculation.py             # PSSM baseline computation (LOPO cross-validation)
│   ├── pssm_baseline.py                    # PSSM matrix operations
│   ├── corelation_calculator.py            # Spearman correlation computations
│   ├── dbNSFP_preprocessor.py             # dbNSFP database query handling
│   ├── plot_graphs.py                      # Visualization generation
│   ├── tables_generator.py                 # Summary table generation
│   └── parallel_script.py                  # Parallel processing utilities
│
├── workflow/
│   ├── rules/                              # Snakemake rule modules
│   │   ├── preprocess.smk                  # Step 1: Data integration
│   │   ├── baseline.smk                    # Step 2: PSSM baseline setup + calculation
│   │   ├── correlation.smk                 # Step 3: Spearman correlations
│   │   ├── plots.smk                       # Step 4: Visualization
│   │   └── tables.smk                      # Step 5: Summary tables
│   ├── envs/                               # Per-step conda environments
│   │   ├── preprocess.yaml
│   │   ├── baseline.yaml
│   │   ├── correlation.yaml
│   │   ├── plots.yaml
│   │   └── tables.yaml
│   └── scripts/                            # Python scripts called by Snakemake rules
│       ├── main.py                         # Path definitions and data loading
│       ├── main_dataframe_preprocessor.py  # Step 1: Build master DataFrame
│       ├── main_pssm_baseline.py           # Step 2a: Add PSSM baseline columns
│       ├── main_baseline_calculation.py    # Step 2b: Compute PSSM baseline scores
│       ├── main_corelation_calculator.py   # Step 3: Calculate Spearman correlations
│       ├── main_plot_graphs.py             # Step 4: Generate comparison plots
│       ├── main_tables.py                  # Step 5: Generate summary tables
│       ├── main_dbNSFP.py                  # dbNSFP score integration and analysis
│       ├── main_mutepred_overlap.py        # MutPred training overlap analysis
│       ├── parallel_main.py                # Parallel dbNSFP processing
│       ├── all_runner.py                   # Legacy sequential runner (superseded by Snakefile)
│       └── testing_code.py                 # Development and testing scripts
│
├── Data/                                   # Input and cached data (git-ignored)
│   ├── mave_gs_data/                       # MAVE gold standard FASTA files
│   ├── dbNSFP/                             # dbNSFP database files (from install.sh)
│   ├── dbNSFP_input_dir/                   # dbNSFP query inputs
│   ├── dbNSFP_output_dir/                  # dbNSFP tool score outputs
│   ├── mutepred_scores/                    # MutPred prediction outputs
│   ├── pickled_dataframes/                 # Cached intermediate DataFrames
│   └── HandPickedMSAs/                     # Multiple sequence alignments
│
├── Plots/                                  # Generated figures (git-ignored)
└── Tables/                                 # Generated summary tables (git-ignored)

Pipeline

The pipeline runs in sequential steps, each producing a cached DataFrame used by the next step:

Data integration (main_dataframe_preprocessor.py) — loads MAVE gold standard data, integrates dbNSFP scores for all 9 tools, and flags mutations that appear in tool training sets (DEOGEN2, ClinVar, MutPred). Outputs gold_std_df.pkl.
PSSM baseline setup (main_pssm_baseline.py) — filters down to human proteins only (non-human proteins are excluded as dbNSFP has no annotations for them) and adds PSSM columns. Outputs pssm_base.pkl.
Baseline calculation (main_baseline_calculation.py) — computes PSSM baseline scores using Leave-One-Protein-Out (LOPO) cross-validation.
Correlation analysis (main_corelation_calculator.py) — calculates Spearman correlations for each tool, both including and excluding training set mutations (correlation_all vs correlation_strict). Outputs gold_std_df_only_human_with_baseline_corelation.pkl.
Visualization (main_plot_graphs.py) — generates bar plots comparing tool performance, sorted by PSSM baseline correlation. Outputs to Plots/.
Tables (main_tables.py) — generates summary tables of human proteins with SNP counts and UniProt IDs. Outputs to Tables/.

To run the full pipeline:

python all_runner.py

To run dbNSFP queries in parallel:

python parallel_main.py

Setup

conda env create -f gpml_env.yml
conda activate gpml_env

Key dependencies: Python 3.9, pandas 1.4, numpy 1.23, scipy 1.9, scikit-learn 1.0, matplotlib 3.5, tensorflow 2.12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPML

Overview

Project Structure

Pipeline

Setup

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 255 Commits
Data		Data
config		config
src		src
workflow		workflow
.gitignore		.gitignore
README.md		README.md
Snakefile		Snakefile
gpml_env.yml		gpml_env.yml
install.sh		install.sh

Folders and files

Latest commit

History

Repository files navigation

GPML

Overview

Project Structure

Pipeline

Setup

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages