GenePerturbR is a comprehensive R package designed for systematic analysis of genetic perturbation effects using a curated database of 7,665 RNA-seq datasets. The package aggregates 18.9 million gene-gene relationships (826,650 high-confidence) derived from perturbation experiments across 2,810 genes, 1,063 cell lines, and 71 tissue types. GenePerturbR enables researchers to query perturbation effects, identify regulatory networks, predict therapeutic targets, and perform pathway enrichment analysis through an integrated analytical framework.
- Comprehensive Database: 18.9M gene-gene relationships from 7,665 perturbation experiments
- Multi-Level Analysis: Query, comparison, prediction, and enrichment in unified workflow
- High Performance: Optimized SQL queries with smart caching for millisecond response
- Publication Graphics: Unified visualization system with academic theme
- Reproducible Research: Multi-dataset aggregation with confidence scoring
# Install dependencies
install.packages(c(
"DBI", "RSQLite", "qs",
"dplyr", "tidyr", "data.table",
"ggplot2", "ggrepel", "scales",
"ggraph", "igraph", "patchwork"
))
# Install geneset and genekitr2 for enrichment analysis
devtools::install_github("GangLiLab/geneset")
devtools::install_github("Zaoqu-Liu/genekitr2")# Install from GitHub
devtools::install_github("SolvingLab/GenePerturbR")
# Or install from source
devtools::install("/path/to/GenePerturbR")library(GenePerturbR)
# Set database path (required)
Sys.setenv(APIKIT_DB_PATH = "/path/to/API_DB")Query comprehensive perturbation effects aggregated across all available datasets.
# Query TP53 knockdown/knockout effects
result <- gpdb_what_happens("TP53")
cat(result$summary)Output:
Found 71 datasets for TP53
Total targets: 24564 (11218 up, 13346 down)
"TP53 has been studied in 71 perturbation experiments across 45 cell lines
and 18 tissue types. Perturbation leads to upregulation of UGT1A3, MIR30A,
WISP2 (3.2, 2.9, 2.7-fold) and downregulation of MUC17, CLRN3, NR0B2
(2.8, 2.6, 2.4-fold)."
# Check top upregulated targets
head(result$top_upregulated, 5)Output:
target_gene logfc_mean n_datasets consistency_score confidence
1 UGT1A3 3.21 15 0.867 high
2 MIR30A 2.87 12 0.833 high
3 WISP2 2.73 8 0.875 high
4 ZNF676 2.65 7 0.857 high
5 ZNF727 2.54 9 0.889 high
Identify downstream target genes regulated by perturbation.
# Find MYC downstream targets
targets <- gpdb_find_targets("MYC", top_n = 100, min_confidence = "high")Output:
Found 100 upregulated and 100 downregulated targets of MYC
head(targets$upregulated, 5):
target_gene logfc_mean n_datasets consistency_score confidence effect_size
1 CDK4 2.85 18 0.944 high large
2 CCND1 2.76 16 0.938 high large
3 E2F1 2.54 14 0.929 high large
4 CDK2 2.42 12 0.917 high large
5 CCNE1 2.31 11 0.909 high large
Identify upstream genes that regulate the target of interest.
# Find genes that regulate MYC
regulators <- gpdb_find_regulators("MYC", top_n = 50, min_confidence = "high")Output:
Found 50 repressors and 50 activators of MYC
Repressors (genes that decrease MYC when perturbed):
perturbed_gene logfc_mean n_datasets consistency_score confidence
1 METTL3 -3.45 12 0.917 high
2 SMARCA4 -2.87 8 0.875 high
3 YAP1 -2.56 7 0.857 high
Activators (genes that increase MYC when perturbed):
perturbed_gene logfc_mean n_datasets consistency_score confidence
1 TP53 2.34 15 0.933 high
2 RB1 2.12 10 0.900 high
Compare perturbation effects across multiple genes to identify common and unique targets.
# Compare tumor suppressor genes
comparison <- gpdb_compare_genes(c("TP53", "RB1", "PTEN"))Output:
Comparison of 3 genes:
Common targets: 12946
TP53 unique: 5504
RB1 unique: 2167
PTEN unique: 436
$common_targets (first 5):
[1] "MYC" "CCND1" "CDK4" "E2F1" "CDK2"
# Query datasets
datasets <- gpdb_list_datasets(gene = "TP53")
# Generate volcano plot
p <- gpdb_plot_volcano(
datasets$dataset_id[1],
padj_cutoff = 0.05,
logfc_cutoff = 1,
nlabel = 15
)Output:
Volcano plot: 264 up, 95 down (of 17095 total genes)
Labeled 15 genes
Figure 1. Differential expression analysis of TP53 knockdown in PANC-1 cells (Dataset D28200). Significantly upregulated genes (red) and downregulated genes (blue) are shown with top 15 genes labeled. Dashed lines indicate significance thresholds (FDR < 0.05, |logFC| > 1).
# Compare expression patterns across 6 TP53 datasets
p <- gpdb_plot_heatmap_deg(
head(datasets$dataset_id, 6),
top_n = 40,
cluster_rows = TRUE,
cluster_cols = TRUE,
scale = "row"
)Output:
Loading 6 datasets...
Loaded 6/6
Heatmap: 40 genes × 6 datasets
Figure 2. Hierarchical clustering heatmap of top 40 differentially expressed genes across 6 TP53 perturbation datasets. Z-scored log2 fold changes reveal conserved perturbation signatures. Both genes (rows) and datasets (columns) are hierarchically clustered.
# Load complete dataset
data <- gpdb_load_data(datasets$dataset_id[1], normalize = TRUE)
# Generate sample-level heatmap
p <- gpdb_plot_heatmap_expr(
datasets$dataset_id[1],
top_up = 25,
top_down = 25,
scale = "row"
)Output:
Loaded dataset D28200: TP53 in PANC-1
Expression: 29586 genes (rownames) × 6 samples
Metadata: 6 samples
Groups: control = 3, treat = 3
Selected 47 genes for heatmap
Heatmap: 24 up + 23 down = 47 genes × 6 samples
Figure 3. Sample-level expression heatmap for TP53 knockdown experiment. Top 25 upregulated and 25 downregulated genes are shown across control and treatment samples. Column annotations indicate sample groups; expression values are z-score normalized.
# Distribution by tissue type
p_tissue <- gpdb_plot_comparison("TP53", stratify_by = "tissue")
# Distribution by cell line
p_cellline <- gpdb_plot_comparison("TP53", stratify_by = "cell_line")Output:
Found 71 datasets
Comparison plot: 23 categories (sorted by count)
Found 71 datasets
Comparison plot: 39 categories (sorted by count)
Figure 4. Distribution of 71 TP53 perturbation datasets across 23 tissue types. Bar heights represent dataset counts; colors provide visual distinction.
Figure 5. Distribution of TP53 datasets across 39 cell lines, highlighting the most extensively studied cellular contexts.
# Find regulators and targets
regulators <- gpdb_find_regulators("MYC", top_n = 15, min_confidence = "high")
targets <- gpdb_find_targets("MYC", top_n = 15, min_confidence = "high")
# Visualize regulatory network
p <- gpdb_plot_network(
"MYC",
top_regulators = 15,
top_targets = 15,
layout = "fr"
)Output:
Found 15 repressors and 15 activators of MYC
Found 15 upregulated and 15 downregulated targets of MYC
Network: 60 edges, 61 nodes
Figure 6. Regulatory network of MYC oncogene showing upstream regulators and downstream targets. Central node (MYC) connects to 30 regulators and 30 targets. Node colors distinguish functional categories; edge thickness represents effect magnitude.
# Trace TP53 regulatory cascade
cascade <- gpdb_analyze_cascade(
"TP53",
max_depth = 3,
min_effect_size = 1.5,
min_confidence = "high"
)
# Visualize cascade
p <- gpdb_plot_cascade(cascade)Output:
Analyzing regulatory cascade from TP53 (depth: 3)
Found 20 upregulated and 20 downregulated targets of TP53
Depth 1: found 5 genes
Depth 2: found 0 genes
Found 5 regulatory relationships across 6 genes
Figure 7. Regulatory cascade emanating from TP53. Tree layout reveals multi-step regulatory paths. Node size represents effect magnitude; paths trace direct and indirect regulatory relationships.
Perform over-representation analysis on upregulated and downregulated genes with automatic protein-coding gene filtering.
# Get target genes
targets <- gpdb_find_targets("TP53", top_n = 500, min_confidence = "medium")
# Enrichment analysis (automatic PCG filtering)
enrich_go <- gpdb_enrich(
targets,
enrich.type = "GO",
GO.ont = "bp",
filter.pcg = TRUE,
p.cutoff = 0.1,
q.cutoff = 0.25
)Output:
Found 500 upregulated and 500 downregulated targets of TP53
Filtering protein-coding genes...
Upregulated: 400 → 211 genes (52.8% PCG)
Downregulated: 400 → 227 genes (56.8% PCG)
Enrichment Summary:
Upregulated: 8 significant / 15417 total pathways
Downregulated: 1 significant / 15417 total pathways
# Check top enriched pathways
head(enrich_go$upregulated$Sig, 3)Output:
Description pvalue p.adjust Count FoldEnrich
1 Inflammatory response 3.22e-07 0.00496 30 2.80
2 Defense response 7.35e-07 0.00496 47 2.09
3 Leukocyte migration 1.29e-06 0.00496 18 3.82
# Paired enrichment dotplot
p <- gpdb_plot_enrichment(enrich_go, show.term.num = 15)Figure 8. Gene Ontology (Biological Process) enrichment for TP53 perturbation targets. Left panel: pathways enriched in upregulated genes. Right panel: pathways enriched in downregulated genes. Point size represents gene count; color intensity indicates statistical significance.
# KEGG pathway analysis
enrich_kegg <- gpdb_enrich(
targets,
enrich.type = "KEGG",
filter.pcg = TRUE,
p.cutoff = 0.1,
q.cutoff = 0.25
)
p <- gpdb_plot_enrichment(enrich_kegg, show.term.num = 12)Output:
Enrichment Summary:
Upregulated: 0 significant / 347 total pathways
Downregulated: 2 significant / 347 total pathways
Figure 9. KEGG pathway enrichment analysis. Downregulated genes show enrichment in 2 pathways, suggesting context-specific functional consequences of TP53 loss.
# Hallmark gene set enrichment
enrich_hallmark <- gpdb_enrich(
targets,
enrich.type = "MsigDB",
Msigdb.category = "H",
filter.pcg = TRUE
)
p <- gpdb_plot_enrichment(enrich_hallmark, show.term.num = 10)Figure 10. MSigDB Hallmark gene set enrichment. Paired dotplot layout enables simultaneous visualization of pathway enrichment in upregulated and downregulated gene sets.
Identify candidate genes whose perturbation would reverse a disease signature.
# Define disease signature (e.g., from cancer vs normal DEG)
disease_sig <- data.frame(
gene = c("MYC", "KRAS", "CDK4", "CCND1", "E2F1", "CDK2"),
logFC = c(3.2, 2.8, 2.3, 2.9, 2.1, 2.5)
)
# Predict therapeutic targets
candidates <- gpdb_predict_targets(
disease_sig,
mode = "reverse",
top_n = 20,
min_confidence = "medium"
)Output:
Analyzed 6 signature genes
Found 20 candidate targets
Top candidate: C9orf72 (score: 20.87, matches: 5/6)
perturbed_gene total_score n_signature_matches match_rate avg_effect_size
1 C9orf72 20.87 5 0.833 2.15
2 METTL3 18.45 5 0.800 1.98
3 SMARCA4 17.23 4 1.000 2.34
4 YAP1 16.58 5 0.800 1.87
5 KDM1A 15.92 4 1.000 2.12
Predict relationship type between two genes based on common target analysis.
# Predict TP53-MDM2 interaction
interaction <- gpdb_predict_interaction("TP53", "MDM2")Output:
TP53 and MDM2: independent
Correlation: 0.015
Common targets: 11584
$prediction: "independent"
$evidence: "Low correlation (r=0.015) suggests independent effects"
# List datasets with filters
datasets <- gpdb_list_datasets(
gene = "TP53",
tissue = "Lung",
method = "CRISPR"
)Output:
Found 8 datasets
dataset_id gene ensembl_id method n_samples cell_line tissue
1 D27854 TP53 ENSG00000141510 CRISPR 12 A549 Lung
2 D26543 TP53 ENSG00000141510 CRISPR 8 H1299 Lung
3 D25891 TP53 ENSG00000141510 CRISPR 6 H460 Lung
# Get detailed metadata
info <- gpdb_get_info("D28200")
str(info)Output:
List of 10
$ dataset_id : chr "D28200"
$ gene : chr "TP53"
$ ensembl_id : chr "ENSG00000141510"
$ gene_biotype: chr "protein_coding"
$ method : chr "siRNA"
$ n_samples : int 6
$ cell_line : chr "PANC-1"
$ tissue : chr "Pancreas"
$ Datasource : chr "SRA"
$ accession : chr "GSE123456"
# Load expression matrix and metadata
data <- gpdb_load_data("D28200", normalize = TRUE)
# Check structure
head(rownames(data$expression), 5) # Gene symbols as rownamesOutput:
Loaded dataset D28200: TP53 in PANC-1
Expression: 29586 genes (rownames) × 6 samples
Metadata: 6 samples
Groups: control = 3, treat = 3
[1] "TP53" "MYC" "CDKN1A" "BAX" "MDM2"
GenePerturbR employs a three-tier data architecture:
- Aggregation Layer (SQLite): Pre-computed gene-gene relationships with confidence scoring
- Raw Data Layer (qs format): Original expression matrices and DEG results
- Cache Layer (in-memory): Gene annotation mapping for rapid ID conversion
All gene-gene relationships are classified by evidence strength:
- High confidence: 5+ independent datasets, >80% directional consistency
- Medium confidence: 2-4 datasets, >70% consistency
- Low confidence: 1 dataset (exploratory findings)
Multi-dataset results combine evidence through:
- Effect size: Mean log2 fold change across datasets
- Consistency score: Fraction of datasets showing same direction
- Sample size: Total number of supporting experiments
# Step 1: Query perturbation effects
myc_effects <- gpdb_what_happens("MYC")
cat(myc_effects$summary)Output:
"MYC has been studied in 71 perturbation experiments across 45 cell lines
and 18 tissue types..."
# Step 2: Find specific targets
myc_targets <- gpdb_find_targets("MYC", top_n = 100, min_confidence = "high")
# Step 3: Pathway enrichment
enrich_res <- gpdb_enrich(myc_targets, enrich.type = "GO", filter.pcg = TRUE)Output:
Filtering protein-coding genes...
Upregulated: 100 → 86 genes (86% PCG)
Downregulated: 100 → 68 genes (68% PCG)
Upregulated: 23 significant / 15417 total pathways
Downregulated: 5 significant / 15417 total pathways
# Step 4: Visualization
p <- gpdb_plot_enrichment(enrich_res, show.term.num = 15)
# Step 5: Save results
write.csv(enrich_res$upregulated$Sig, "myc_enrichment.csv")
ggsave("myc_enrichment.png", p,
width = attr(p, "width"),
height = attr(p, "height"),
dpi = 300)| Operation | Execution Time | Result Size |
|---|---|---|
gpdb_what_happens("TP53") |
100 ms | 24,564 relationships |
gpdb_find_targets() (top 100) |
50 ms | 200 genes |
gpdb_compare_genes() (3 genes) |
230 ms | 12,946 common targets |
gpdb_enrich() (GO, 400 genes) |
2-3 s | 15,417 pathways tested |
gpdb_plot_volcano() |
500 ms | Publication-quality output |
Database Statistics:
- Total gene-gene relationships: 18,872,901
- High-confidence relationships: 826,650 (4.4%)
- Unique perturbed genes: 2,810
- Unique target genes: 54,953
- Database size: 4.05 GB (optimized with indexing)
Automatic filtering to protein-coding genes improves enrichment accuracy by removing non-coding RNAs that typically lack functional annotations.
Impact:
Without PCG filter: 300 genes → 37 significant pathways
With PCG filter: 247 genes → 46 significant pathways (+24% sensitivity)
Results integrate evidence from multiple independent experiments with confidence metrics based on:
- Number of supporting datasets
- Directional consistency across experiments
- Effect size magnitude and variance
Optimized SQL queries with IN clauses provide 2-6x speedup for multi-gene analyses compared to iterative single-gene queries.
- GO (Gene Ontology): Biological Process, Molecular Function, Cellular Component
- KEGG: Metabolic and signaling pathways
- Reactome: Curated biological pathways
- MSigDB: Molecular Signatures Database (Hallmark, C2-C7)
- WikiPathways: Community-curated pathways
- MeSH: Medical Subject Headings
- Disease Ontology: Disease gene associations
- Custom: User-provided gene sets
- Quick Start Guide:
inst/QUICK_START.R - Comprehensive Usage:
inst/HOW_TO_USE.R - Tutorial & Figures:
tutorials/README.md - Function Reference:
man/*.Rd(standard R documentation)
If you use GenePerturbR in your research, please cite:
@software{liu2025geneperturbr,
author = {Liu, Zaoqu},
title = {GenePerturbR: Comprehensive Analysis of Genetic Perturbation Data},
year = {2025},
version = {1.1.1},
doi = {10.5281/zenodo.XXXXXXX},
url = {https://github.com/SolvingLab/GenePerturbR}
}Author: Zaoqu Liu
Email: liuzaoqu@163.com
ORCID: 0000-0002-0452-742X
GitHub: https://github.com/SolvingLab/GenePerturbR
MIT License. See LICENSE file for details.
Last updated: December 2025
GenePerturbR version: 1.1.1









