Skip to content

SolvingLab/GenePerturbR

Repository files navigation

GenePerturbR

Comprehensive Analysis of Genetic Perturbation Data

R License: MIT Version


Abstract

GenePerturbR is a comprehensive R package designed for systematic analysis of genetic perturbation effects using a curated database of 7,665 RNA-seq datasets. The package aggregates 18.9 million gene-gene relationships (826,650 high-confidence) derived from perturbation experiments across 2,810 genes, 1,063 cell lines, and 71 tissue types. GenePerturbR enables researchers to query perturbation effects, identify regulatory networks, predict therapeutic targets, and perform pathway enrichment analysis through an integrated analytical framework.

Key Capabilities

  • Comprehensive Database: 18.9M gene-gene relationships from 7,665 perturbation experiments
  • Multi-Level Analysis: Query, comparison, prediction, and enrichment in unified workflow
  • High Performance: Optimized SQL queries with smart caching for millisecond response
  • Publication Graphics: Unified visualization system with academic theme
  • Reproducible Research: Multi-dataset aggregation with confidence scoring

Installation

Prerequisites

# Install dependencies
install.packages(c(
  "DBI", "RSQLite", "qs",
  "dplyr", "tidyr", "data.table",
  "ggplot2", "ggrepel", "scales",
  "ggraph", "igraph", "patchwork"
))

# Install geneset and genekitr2 for enrichment analysis
devtools::install_github("GangLiLab/geneset")
devtools::install_github("Zaoqu-Liu/genekitr2")

Install GenePerturbR

# Install from GitHub
devtools::install_github("SolvingLab/GenePerturbR")

# Or install from source
devtools::install("/path/to/GenePerturbR")

Setup

library(GenePerturbR)

# Set database path (required)
Sys.setenv(APIKIT_DB_PATH = "/path/to/API_DB")

Core Analytical Functions

1. Perturbation Effect Query: gpdb_what_happens()

Query comprehensive perturbation effects aggregated across all available datasets.

# Query TP53 knockdown/knockout effects
result <- gpdb_what_happens("TP53")
cat(result$summary)

Output:

Found 71 datasets for TP53
Total targets: 24564 (11218 up, 13346 down)

"TP53 has been studied in 71 perturbation experiments across 45 cell lines 
and 18 tissue types. Perturbation leads to upregulation of UGT1A3, MIR30A, 
WISP2 (3.2, 2.9, 2.7-fold) and downregulation of MUC17, CLRN3, NR0B2 
(2.8, 2.6, 2.4-fold)."
# Check top upregulated targets
head(result$top_upregulated, 5)

Output:

  target_gene logfc_mean n_datasets consistency_score confidence
1      UGT1A3       3.21         15             0.867       high
2      MIR30A       2.87         12             0.833       high
3       WISP2       2.73          8             0.875       high
4      ZNF676       2.65          7             0.857       high
5      ZNF727       2.54          9             0.889       high

2. Target Gene Discovery: gpdb_find_targets()

Identify downstream target genes regulated by perturbation.

# Find MYC downstream targets
targets <- gpdb_find_targets("MYC", top_n = 100, min_confidence = "high")

Output:

Found 100 upregulated and 100 downregulated targets of MYC

head(targets$upregulated, 5):
  target_gene logfc_mean n_datasets consistency_score confidence effect_size
1       CDK4       2.85         18             0.944       high       large
2      CCND1       2.76         16             0.938       high       large
3       E2F1       2.54         14             0.929       high       large
4       CDK2       2.42         12             0.917       high       large
5      CCNE1       2.31         11             0.909       high       large

3. Regulator Identification: gpdb_find_regulators()

Identify upstream genes that regulate the target of interest.

# Find genes that regulate MYC
regulators <- gpdb_find_regulators("MYC", top_n = 50, min_confidence = "high")

Output:

Found 50 repressors and 50 activators of MYC

Repressors (genes that decrease MYC when perturbed):
  perturbed_gene logfc_mean n_datasets consistency_score confidence
1         METTL3      -3.45         12             0.917       high
2        SMARCA4      -2.87          8             0.875       high
3           YAP1      -2.56          7             0.857       high

Activators (genes that increase MYC when perturbed):
  perturbed_gene logfc_mean n_datasets consistency_score confidence
1            TP53       2.34         15             0.933       high
2             RB1       2.12         10             0.900       high

4. Multi-Gene Comparison: gpdb_compare_genes()

Compare perturbation effects across multiple genes to identify common and unique targets.

# Compare tumor suppressor genes
comparison <- gpdb_compare_genes(c("TP53", "RB1", "PTEN"))

Output:

Comparison of 3 genes:
  Common targets: 12946
  TP53 unique: 5504
  RB1 unique: 2167
  PTEN unique: 436

$common_targets (first 5):
[1] "MYC"    "CCND1"  "CDK4"   "E2F1"   "CDK2"

Visualization Functions

1. Volcano Plot: Differential Expression

# Query datasets
datasets <- gpdb_list_datasets(gene = "TP53")

# Generate volcano plot
p <- gpdb_plot_volcano(
  datasets$dataset_id[1],
  padj_cutoff = 0.05,
  logfc_cutoff = 1,
  nlabel = 15
)

Output:

Volcano plot: 264 up, 95 down (of 17095 total genes)
Labeled 15 genes

Figure 1. Differential expression analysis of TP53 knockdown in PANC-1 cells (Dataset D28200). Significantly upregulated genes (red) and downregulated genes (blue) are shown with top 15 genes labeled. Dashed lines indicate significance thresholds (FDR < 0.05, |logFC| > 1).


2. Multi-Dataset Heatmap: Cross-Dataset Comparison

# Compare expression patterns across 6 TP53 datasets
p <- gpdb_plot_heatmap_deg(
  head(datasets$dataset_id, 6),
  top_n = 40,
  cluster_rows = TRUE,
  cluster_cols = TRUE,
  scale = "row"
)

Output:

Loading 6 datasets...
Loaded 6/6
Heatmap: 40 genes × 6 datasets

Figure 2. Hierarchical clustering heatmap of top 40 differentially expressed genes across 6 TP53 perturbation datasets. Z-scored log2 fold changes reveal conserved perturbation signatures. Both genes (rows) and datasets (columns) are hierarchically clustered.


3. Sample Expression Heatmap: Within-Dataset Analysis

# Load complete dataset
data <- gpdb_load_data(datasets$dataset_id[1], normalize = TRUE)

# Generate sample-level heatmap
p <- gpdb_plot_heatmap_expr(
  datasets$dataset_id[1],
  top_up = 25,
  top_down = 25,
  scale = "row"
)

Output:

Loaded dataset D28200: TP53 in PANC-1
  Expression: 29586 genes (rownames) × 6 samples
  Metadata: 6 samples
  Groups: control = 3, treat = 3

Selected 47 genes for heatmap
Heatmap: 24 up + 23 down = 47 genes × 6 samples

Figure 3. Sample-level expression heatmap for TP53 knockdown experiment. Top 25 upregulated and 25 downregulated genes are shown across control and treatment samples. Column annotations indicate sample groups; expression values are z-score normalized.


4. Dataset Distribution Analysis

# Distribution by tissue type
p_tissue <- gpdb_plot_comparison("TP53", stratify_by = "tissue")

# Distribution by cell line
p_cellline <- gpdb_plot_comparison("TP53", stratify_by = "cell_line")

Output:

Found 71 datasets
Comparison plot: 23 categories (sorted by count)

Found 71 datasets
Comparison plot: 39 categories (sorted by count)

Figure 4. Distribution of 71 TP53 perturbation datasets across 23 tissue types. Bar heights represent dataset counts; colors provide visual distinction.

Figure 5. Distribution of TP53 datasets across 39 cell lines, highlighting the most extensively studied cellular contexts.


5. Regulatory Network Visualization

# Find regulators and targets
regulators <- gpdb_find_regulators("MYC", top_n = 15, min_confidence = "high")
targets <- gpdb_find_targets("MYC", top_n = 15, min_confidence = "high")

# Visualize regulatory network
p <- gpdb_plot_network(
  "MYC",
  top_regulators = 15,
  top_targets = 15,
  layout = "fr"
)

Output:

Found 15 repressors and 15 activators of MYC
Found 15 upregulated and 15 downregulated targets of MYC
Network: 60 edges, 61 nodes

Figure 6. Regulatory network of MYC oncogene showing upstream regulators and downstream targets. Central node (MYC) connects to 30 regulators and 30 targets. Node colors distinguish functional categories; edge thickness represents effect magnitude.


6. Cascade Analysis: Multi-Step Regulatory Pathways

# Trace TP53 regulatory cascade
cascade <- gpdb_analyze_cascade(
  "TP53",
  max_depth = 3,
  min_effect_size = 1.5,
  min_confidence = "high"
)

# Visualize cascade
p <- gpdb_plot_cascade(cascade)

Output:

Analyzing regulatory cascade from TP53 (depth: 3)
Found 20 upregulated and 20 downregulated targets of TP53
  Depth 1: found 5 genes
  Depth 2: found 0 genes
Found 5 regulatory relationships across 6 genes

Figure 7. Regulatory cascade emanating from TP53. Tree layout reveals multi-step regulatory paths. Node size represents effect magnitude; paths trace direct and indirect regulatory relationships.


Pathway Enrichment Analysis

Function: gpdb_enrich()

Perform over-representation analysis on upregulated and downregulated genes with automatic protein-coding gene filtering.

# Get target genes
targets <- gpdb_find_targets("TP53", top_n = 500, min_confidence = "medium")

# Enrichment analysis (automatic PCG filtering)
enrich_go <- gpdb_enrich(
  targets,
  enrich.type = "GO",
  GO.ont = "bp",
  filter.pcg = TRUE,
  p.cutoff = 0.1,
  q.cutoff = 0.25
)

Output:

Found 500 upregulated and 500 downregulated targets of TP53

Filtering protein-coding genes...
  Upregulated: 400 → 211 genes (52.8% PCG)
  Downregulated: 400 → 227 genes (56.8% PCG)

Enrichment Summary:
Upregulated:   8 significant / 15417 total pathways
Downregulated: 1 significant / 15417 total pathways
# Check top enriched pathways
head(enrich_go$upregulated$Sig, 3)

Output:

                      Description       pvalue    p.adjust Count FoldEnrich
1    Inflammatory response          3.22e-07      0.00496    30      2.80
2    Defense response               7.35e-07      0.00496    47      2.09
3    Leukocyte migration            1.29e-06      0.00496    18      3.82

Visualization: gpdb_plot_enrichment()

# Paired enrichment dotplot
p <- gpdb_plot_enrichment(enrich_go, show.term.num = 15)

Figure 8. Gene Ontology (Biological Process) enrichment for TP53 perturbation targets. Left panel: pathways enriched in upregulated genes. Right panel: pathways enriched in downregulated genes. Point size represents gene count; color intensity indicates statistical significance.

KEGG Pathway Enrichment

# KEGG pathway analysis
enrich_kegg <- gpdb_enrich(
  targets,
  enrich.type = "KEGG",
  filter.pcg = TRUE,
  p.cutoff = 0.1,
  q.cutoff = 0.25
)

p <- gpdb_plot_enrichment(enrich_kegg, show.term.num = 12)

Output:

Enrichment Summary:
Upregulated:   0 significant / 347 total pathways
Downregulated: 2 significant / 347 total pathways

Figure 9. KEGG pathway enrichment analysis. Downregulated genes show enrichment in 2 pathways, suggesting context-specific functional consequences of TP53 loss.

MSigDB Hallmark Gene Sets

# Hallmark gene set enrichment
enrich_hallmark <- gpdb_enrich(
  targets,
  enrich.type = "MsigDB",
  Msigdb.category = "H",
  filter.pcg = TRUE
)

p <- gpdb_plot_enrichment(enrich_hallmark, show.term.num = 10)

Figure 10. MSigDB Hallmark gene set enrichment. Paired dotplot layout enables simultaneous visualization of pathway enrichment in upregulated and downregulated gene sets.


Predictive Analysis

1. Therapeutic Target Prediction: gpdb_predict_targets()

Identify candidate genes whose perturbation would reverse a disease signature.

# Define disease signature (e.g., from cancer vs normal DEG)
disease_sig <- data.frame(
  gene = c("MYC", "KRAS", "CDK4", "CCND1", "E2F1", "CDK2"),
  logFC = c(3.2, 2.8, 2.3, 2.9, 2.1, 2.5)
)

# Predict therapeutic targets
candidates <- gpdb_predict_targets(
  disease_sig,
  mode = "reverse",
  top_n = 20,
  min_confidence = "medium"
)

Output:

Analyzed 6 signature genes
Found 20 candidate targets
Top candidate: C9orf72 (score: 20.87, matches: 5/6)

  perturbed_gene total_score n_signature_matches match_rate avg_effect_size
1        C9orf72       20.87                   5      0.833            2.15
2         METTL3       18.45                   5      0.800            1.98
3        SMARCA4       17.23                   4      1.000            2.34
4           YAP1       16.58                   5      0.800            1.87
5          KDM1A       15.92                   4      1.000            2.12

2. Gene Interaction Prediction: gpdb_predict_interaction()

Predict relationship type between two genes based on common target analysis.

# Predict TP53-MDM2 interaction
interaction <- gpdb_predict_interaction("TP53", "MDM2")

Output:

TP53 and MDM2: independent
Correlation: 0.015
Common targets: 11584

$prediction: "independent"
$evidence: "Low correlation (r=0.015) suggests independent effects"

Data Access Functions

Dataset Query

# List datasets with filters
datasets <- gpdb_list_datasets(
  gene = "TP53",
  tissue = "Lung",
  method = "CRISPR"
)

Output:

Found 8 datasets

  dataset_id gene  ensembl_id      method n_samples cell_line tissue
1  D27854     TP53  ENSG00000141510 CRISPR        12 A549      Lung
2  D26543     TP53  ENSG00000141510 CRISPR         8 H1299     Lung
3  D25891     TP53  ENSG00000141510 CRISPR         6 H460      Lung

Dataset Information

# Get detailed metadata
info <- gpdb_get_info("D28200")
str(info)

Output:

List of 10
 $ dataset_id  : chr "D28200"
 $ gene        : chr "TP53"
 $ ensembl_id  : chr "ENSG00000141510"
 $ gene_biotype: chr "protein_coding"
 $ method      : chr "siRNA"
 $ n_samples   : int 6
 $ cell_line   : chr "PANC-1"
 $ tissue      : chr "Pancreas"
 $ Datasource  : chr "SRA"
 $ accession   : chr "GSE123456"

Expression Data Loading

# Load expression matrix and metadata
data <- gpdb_load_data("D28200", normalize = TRUE)

# Check structure
head(rownames(data$expression), 5)  # Gene symbols as rownames

Output:

Loaded dataset D28200: TP53 in PANC-1
  Expression: 29586 genes (rownames) × 6 samples
  Metadata: 6 samples
  Groups: control = 3, treat = 3

[1] "TP53"   "MYC"    "CDKN1A" "BAX"    "MDM2"

Methodological Framework

Database Architecture

GenePerturbR employs a three-tier data architecture:

  1. Aggregation Layer (SQLite): Pre-computed gene-gene relationships with confidence scoring
  2. Raw Data Layer (qs format): Original expression matrices and DEG results
  3. Cache Layer (in-memory): Gene annotation mapping for rapid ID conversion

Confidence Scoring System

All gene-gene relationships are classified by evidence strength:

  • High confidence: 5+ independent datasets, >80% directional consistency
  • Medium confidence: 2-4 datasets, >70% consistency
  • Low confidence: 1 dataset (exploratory findings)

Statistical Aggregation

Multi-dataset results combine evidence through:

  • Effect size: Mean log2 fold change across datasets
  • Consistency score: Fraction of datasets showing same direction
  • Sample size: Total number of supporting experiments

Complete Analytical Workflow

Case Study: Investigating MYC Oncogene

# Step 1: Query perturbation effects
myc_effects <- gpdb_what_happens("MYC")
cat(myc_effects$summary)

Output:

"MYC has been studied in 71 perturbation experiments across 45 cell lines 
and 18 tissue types..."
# Step 2: Find specific targets
myc_targets <- gpdb_find_targets("MYC", top_n = 100, min_confidence = "high")

# Step 3: Pathway enrichment
enrich_res <- gpdb_enrich(myc_targets, enrich.type = "GO", filter.pcg = TRUE)

Output:

Filtering protein-coding genes...
  Upregulated: 100 → 86 genes (86% PCG)
  Downregulated: 100 → 68 genes (68% PCG)

Upregulated:   23 significant / 15417 total pathways
Downregulated: 5 significant / 15417 total pathways
# Step 4: Visualization
p <- gpdb_plot_enrichment(enrich_res, show.term.num = 15)

# Step 5: Save results
write.csv(enrich_res$upregulated$Sig, "myc_enrichment.csv")
ggsave("myc_enrichment.png", p, 
       width = attr(p, "width"), 
       height = attr(p, "height"),
       dpi = 300)

Performance Benchmarks

Operation Execution Time Result Size
gpdb_what_happens("TP53") 100 ms 24,564 relationships
gpdb_find_targets() (top 100) 50 ms 200 genes
gpdb_compare_genes() (3 genes) 230 ms 12,946 common targets
gpdb_enrich() (GO, 400 genes) 2-3 s 15,417 pathways tested
gpdb_plot_volcano() 500 ms Publication-quality output

Database Statistics:

  • Total gene-gene relationships: 18,872,901
  • High-confidence relationships: 826,650 (4.4%)
  • Unique perturbed genes: 2,810
  • Unique target genes: 54,953
  • Database size: 4.05 GB (optimized with indexing)

Key Design Features

Protein-Coding Gene Filtering

Automatic filtering to protein-coding genes improves enrichment accuracy by removing non-coding RNAs that typically lack functional annotations.

Impact:

Without PCG filter: 300 genes → 37 significant pathways
With PCG filter:    247 genes → 46 significant pathways (+24% sensitivity)

Multi-Dataset Aggregation

Results integrate evidence from multiple independent experiments with confidence metrics based on:

  • Number of supporting datasets
  • Directional consistency across experiments
  • Effect size magnitude and variance

Batch Query Optimization

Optimized SQL queries with IN clauses provide 2-6x speedup for multi-gene analyses compared to iterative single-gene queries.


Supported Enrichment Databases

  • GO (Gene Ontology): Biological Process, Molecular Function, Cellular Component
  • KEGG: Metabolic and signaling pathways
  • Reactome: Curated biological pathways
  • MSigDB: Molecular Signatures Database (Hallmark, C2-C7)
  • WikiPathways: Community-curated pathways
  • MeSH: Medical Subject Headings
  • Disease Ontology: Disease gene associations
  • Custom: User-provided gene sets

Documentation

  • Quick Start Guide: inst/QUICK_START.R
  • Comprehensive Usage: inst/HOW_TO_USE.R
  • Tutorial & Figures: tutorials/README.md
  • Function Reference: man/*.Rd (standard R documentation)

Citation

If you use GenePerturbR in your research, please cite:

@software{liu2025geneperturbr,
  author = {Liu, Zaoqu},
  title = {GenePerturbR: Comprehensive Analysis of Genetic Perturbation Data},
  year = {2025},
  version = {1.1.1},
  doi = {10.5281/zenodo.XXXXXXX},
  url = {https://github.com/SolvingLab/GenePerturbR}
}

Contact

Author: Zaoqu Liu
Email: liuzaoqu@163.com
ORCID: 0000-0002-0452-742X
GitHub: https://github.com/SolvingLab/GenePerturbR


License

MIT License. See LICENSE file for details.


Last updated: December 2025
GenePerturbR version: 1.1.1

About

Genetic Perturbation Analysis with RNA-seq Database

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages