RAPTOR

RNA-seq Analysis Pipeline Testing and Optimization Resource

Making free science for everybody around the world 🌍

Quick Start • Features • Installation • Documentation • Pipelines • Citation

⚠️ Important Notice - Repository Under Active Development

🚧 This repository is currently undergoing major refactoring and debugging. Please do not clone or use in production at this time.

We are implementing a new, cleaner architecture that will be:

✅ More maintainable and easier to understand
✅ Better organized with clear module structure
✅ Fully tested and production-ready

Expected completion: End of Febbruary.

We appreciate your patience as we work to deliver a high-quality, stable release. Please check back soon!

🆕 What's New in v2.1.2

Adaptive Threshold Optimizer (ATO)

Stop using arbitrary thresholds! The new Adaptive Threshold Optimizer determines data-driven significance cutoffs for differential expression analysis.

from raptor.threshold_optimizer import optimize_thresholds
import pandas as pd

df = pd.read_csv('deseq2_results.csv')
result = optimize_thresholds(df, goal='discovery')

print(f"Optimal logFC: {result.logfc_threshold:.2f}")
print(f"Significant genes: {result.n_significant}")
print(f"\n{result.methods_text}")  # Publication-ready!

Key Features:

Multiple p-value adjustment methods (BH, BY, Storey q-value, Holm, Bonferroni)
Five logFC optimization methods (MAD, mixture model, power-based, percentile, consensus)
π₀ estimation for true null proportion
Three analysis goals: discovery, balanced, validation
Auto-generated publication methods text
Interactive dashboard integration

What is RAPTOR?

RAPTOR is a comprehensive framework for benchmarking and optimizing RNA-seq differential expression analysis pipelines. Instead of guessing which pipeline works best for your data, RAPTOR provides evidence-based, ML-powered recommendations through systematic comparison of 8 popular pipelines.

Why RAPTOR?

Challenge	RAPTOR Solution
Which pipeline should I use?	✅ ML recommendations with 87% accuracy
What thresholds should I use?	✅ Adaptive Threshold Optimizer (NEW!)
Is my data quality good enough?	✅ Quality assessment with batch effect detection
How do I know results are reliable?	✅ Ensemble analysis combining multiple pipelines
What resources do I need?	✅ Resource monitoring with predictions
How do I present results?	✅ Automated reports publication-ready

Features

Adaptive Threshold Optimizer (NEW!)

Data-driven logFC and p-value thresholds
Multiple statistical methods
Publication-ready methods text
Interactive dashboard page

ML-Based Recommendations

87% prediction accuracy
Confidence scoring (0-100%)
Learns from 10,000+ analyses
Explains its reasoning

Quality Assessment

6-component quality scoring
Batch effect detection
Outlier identification
Actionable recommendations

Ensemble Analysis

5 combination methods
33% fewer false positives
High-confidence gene lists
Consensus validation

Interactive Dashboard

Web-based interface (no coding!)
Real-time visualizations
Drag-and-drop data upload
One-click reports

Resource Monitoring

Real-time CPU/memory tracking
<1% performance overhead
Resource predictions
Cost estimation for cloud

Quick Start

Option 1: Interactive Dashboard (Recommended)

# Install
pip install raptor-rnaseq

# Launch dashboard
raptor dashboard

# Opens at http://localhost:8501
# Upload data → Get ML recommendation → Use 🎯 Threshold Optimizer → Done!

Option 2: Command Line

# Profile your data and get ML recommendation
raptor profile --counts counts.csv --metadata metadata.csv --use-ml

# Run recommended pipeline
raptor run --pipeline 3 --data fastq/ --output results/

# Optimize thresholds (NEW!)
raptor optimize-thresholds --input results.csv --goal balanced

# Generate report
raptor report --results results/ --output report.html

Option 3: Python API

from raptor import RNAseqDataProfiler, MLPipelineRecommender
from raptor.threshold_optimizer import optimize_thresholds

# Profile your data
profiler = RNAseqDataProfiler(counts, metadata)
profile = profiler.run_full_profile()

# Get ML recommendation
recommender = MLPipelineRecommender()
recommendation = recommender.recommend(profile)

print(f"Recommended: Pipeline {recommendation['pipeline_id']}")
print(f"Confidence: {recommendation['confidence']:.1%}")

# After running pipeline, optimize thresholds (NEW!)
de_results = pd.read_csv('de_results.csv')
result = optimize_thresholds(de_results, goal='balanced')
print(f"Optimal |logFC|: {result.logfc_threshold:.2f}")
print(result.methods_text)

Installation

Requirements

Python: 3.8 or higher
R: 4.0 or higher (for DE analysis)
RAM: 8GB minimum (16GB recommended)
Disk: 10GB free space

Install from PyPI (Recommended)

pip install raptor-rnaseq

With optional dependencies:

# With dashboard support
pip install raptor-rnaseq[dashboard]

# With all features
pip install raptor-rnaseq[all]

Install from GitHub

# Clone repository
git clone https://github.com/AyehBlk/RAPTOR.git
cd RAPTOR

# Install Python dependencies
pip install -r requirements.txt

# Verify installation
python install.py

Conda Environment

conda env create -f environment.yml
conda activate raptor

Pipelines

RAPTOR benchmarks 8 RNA-seq analysis pipelines:

ID	Pipeline	Aligner	Quantifier	DE Tool	Speed	ML Rank
1	STAR-RSEM-DESeq2	STAR	RSEM	DESeq2	⭐⭐	#2
2	HISAT2-StringTie-Ballgown	HISAT2	StringTie	Ballgown	⭐⭐⭐	#5
3	Salmon-edgeR ⭐	Salmon	Salmon	edgeR	⭐⭐⭐⭐⭐	#1
4	Kallisto-Sleuth	Kallisto	Kallisto	Sleuth	⭐⭐⭐⭐⭐	#3
5	STAR-HTSeq-limma	STAR	HTSeq	limma-voom	⭐⭐	#4
6	STAR-featureCounts-NOISeq	STAR	featureCounts	NOISeq	⭐⭐	#6
7	Bowtie2-RSEM-EBSeq	Bowtie2	RSEM	EBSeq	⭐⭐	#7
8	HISAT2-Cufflinks-Cuffdiff	HISAT2	Cufflinks	Cuffdiff	⭐	#8

⭐ Pipeline 3 (Salmon-edgeR) is the ML model's most frequently recommended pipeline due to its optimal speed/accuracy balance.

Repository Structure

RAPTOR/
├── raptor/                 # Core Python package
│   ├── profiler.py         # Data profiling
│   ├── recommender.py      # Rule-based recommendations
│   ├── ml_recommender.py   # ML recommendations
│   ├── threshold_optimizer/ # 🆕 Adaptive Threshold Optimizer (v2.1.2)
│   │   ├── __init__.py
│   │   ├── ato.py          # Core ATO class
│   │   └── visualization.py # ATO visualizations
│   ├── data_quality_assessment.py
│   ├── ensemble_analysis.py
│   ├── resource_monitoring.py
│   └── ...
├── dashboard/              # Interactive web dashboard
├── pipelines/              # Pipeline configurations (8 pipelines)
├── scripts/                # Workflow scripts (00-10)
├── examples/               # Example scripts & demos
├── tests/                  # Test suite
├── docs/                   # Documentation
├── config/                 # Configuration templates
├── install.py              # Master installer
├── launch_dashboard.py     # Dashboard launcher
├── requirements.txt        # Python dependencies
└── setup.py                # Package setup

Documentation

Getting Started

Document	Description
INSTALLATION.md	Detailed installation guide
QUICK_START.md	5-minute quick start
DASHBOARD.md	Interactive dashboard guide

Core Features

Document	Description
THRESHOLD_OPTIMIZER.md	🆕 Adaptive threshold optimization
PROFILE_RECOMMEND.md	Data profiling & recommendations
QUALITY_ASSESSMENT.md	Quality scoring & batch effects
BENCHMARKING.md	Pipeline benchmarking

Advanced Features

Document	Description
ENSEMBLE.md	Multi-pipeline ensemble analysis
RESOURCE_MONITORING.md	Resource tracking
CLOUD_DEPLOYMENT.md	AWS/GCP/Azure deployment

Reference

Document	Description
PIPELINES.md	Pipeline details & selection guide
API.md	Python API reference
FAQ.md	Frequently asked questions
CHANGELOG.md	Version history

Usage Examples

Example 1: Quick Threshold Optimization (NEW!)

from raptor.threshold_optimizer import optimize_thresholds
import pandas as pd

# Load DE results
df = pd.read_csv('deseq2_results.csv')

# Optimize thresholds
result = optimize_thresholds(df, goal='balanced')

print(f"Optimal |logFC|: {result.logfc_threshold:.3f}")
print(f"Significant genes: {result.n_significant}")
print(f"π₀ estimate: {result.pi0:.3f}")

# Get publication methods text
print(result.methods_text)

# Save results
result.results_df.to_csv('optimized_results.csv')

Example 2: Full Workflow

from raptor import RNAseqDataProfiler, MLPipelineRecommender
from raptor.threshold_optimizer import optimize_thresholds
import pandas as pd

# 1. Profile data
counts = pd.read_csv('counts.csv', index_col=0)
metadata = pd.read_csv('metadata.csv')

profiler = RNAseqDataProfiler(counts, metadata, use_ml=True)
profile = profiler.profile(quality_check=True)
print(f"Quality Score: {profile['quality_score']}/100")

# 2. Get ML recommendation
recommender = MLPipelineRecommender()
recommendations = recommender.recommend(profile, n=3)
print(f"Recommended: {recommendations[0]['pipeline_name']}")

# 3. [Run recommended pipeline - produces DE results]
# raptor run --pipeline 3 ...

# 4. Optimize thresholds (NEW in v2.1.2)
de_results = pd.read_csv('deseq2_results.csv')
result = optimize_thresholds(
    de_results,
    logfc_col='log2FoldChange',
    pvalue_col='pvalue',
    goal='balanced'
)

print(f"\n🎯 Optimized Thresholds:")
print(f"   LogFC: |{result.logfc_threshold:.3f}|")
print(f"   Significant: {result.n_significant} genes")

# 5. Save results with methods text
result.results_df.to_csv('final_results.csv')
with open('methods.txt', 'w') as f:
    f.write(result.methods_text)

Example 3: Ensemble Analysis with ATO

from raptor.ensemble_analysis import EnsembleAnalyzer
from raptor.threshold_optimizer import optimize_thresholds

# Combine results from multiple pipelines
analyzer = EnsembleAnalyzer()
consensus = analyzer.combine_results(
    results_dict={'deseq2': df1, 'edger': df2, 'limma': df3},
    method='weighted_vote',
    min_agreement=2
)

# Use ATO for uniform thresholds across ensemble
result = optimize_thresholds(consensus['combined'], goal='balanced')
print(f"Consensus DE genes: {result.n_significant}")

Performance

ML Recommendation Accuracy

Metric	Value
Overall Accuracy	87%
Top-3 Accuracy	96%
Prediction Time	<0.1s
Training Data	10,000+ analyses

Threshold Optimizer Benefits

Metric	Traditional	With ATO
Threshold justification	Arbitrary	Data-driven
Methods text	Manual	Auto-generated
False positives	Higher	Optimized
Reproducibility	Variable	Standardized

Contributing

We welcome contributions! RAPTOR is open-source and aims to make free science accessible to everyone.

# Fork and clone
git clone https://github.com/YOUR_USERNAME/RAPTOR.git

# Create feature branch
git checkout -b feature/amazing-feature

# Make changes and test
pytest tests/

# Submit pull request

See CONTRIBUTING.md for guidelines.

Citation

If you use RAPTOR in your research, please cite:

@software{bolouki2025raptor,
  author       = {Bolouki, Ayeh},
  title        = {RAPTOR: RNA-seq Analysis Pipeline Testing and Optimization Resource},
  year         = {2025},
  version      = {2.1.1},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.17607161},
  url          = {https://github.com/AyehBlk/RAPTOR}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License
Copyright (c) 2025 Ayeh Bolouki

Contact

Ayeh Bolouki

🏛️ GIGA, University of Liège, Belgium
📧 Email: ayehbolouki1988@gmail.com
🐙 GitHub: @AyehBlk
🔬 Research: Computational Biology, Bioinformatics, Multi-omics Analysis

Acknowledgments

The Bioconductor community for the R package ecosystem
All users who provided feedback

⭐ Star this repository if you find RAPTOR useful!

RAPTOR v2.1.2 - Making pipeline selection evidence-based, not guesswork 🦖

Name		Name	Last commit message	Last commit date
Latest commit History 299 Commits
config		config
dashboard		dashboard
docs		docs
examples		examples
pipelines		pipelines
raptor		raptor
scripts		scripts
tests		tests
ARCHITECTURE_DIAGRAM.md		ARCHITECTURE_DIAGRAM.md
CITATION.cff		CITATION.cff
COMPLETE_INDEX.txt		COMPLETE_INDEX.txt
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
install.py		install.py
launch_dashboard.py		launch_dashboard.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

RAPTOR

RNA-seq Analysis Pipeline Testing and Optimization Resource

⚠️ Important Notice - Repository Under Active Development

🆕 What's New in v2.1.2

Adaptive Threshold Optimizer (ATO)

What is RAPTOR?

Why RAPTOR?

Features

Adaptive Threshold Optimizer (NEW!)

ML-Based Recommendations

Quality Assessment

Ensemble Analysis

Interactive Dashboard

Resource Monitoring

Quick Start

Option 1: Interactive Dashboard (Recommended)

Option 2: Command Line

Option 3: Python API

Installation

Requirements

Install from PyPI (Recommended)

Install from GitHub

Conda Environment

Pipelines

Repository Structure

Documentation

Getting Started

Core Features

Advanced Features

Reference

Usage Examples

Example 1: Quick Threshold Optimization (NEW!)

Example 2: Full Workflow

Example 3: Ensemble Analysis with ATO

Performance

ML Recommendation Accuracy

Threshold Optimizer Benefits

Contributing

Citation

License

Contact

Acknowledgments

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages