Making free science for everybody around the world π
Quick Start β’ Features β’ Installation β’ Documentation β’ Pipelines β’ Citation
π§ This repository is currently undergoing major refactoring and debugging. Please do not clone or use in production at this time.
We are implementing a new, cleaner architecture that will be:
- β More maintainable and easier to understand
- β Better organized with clear module structure
- β Fully tested and production-ready
Expected completion: End of Febbruary.
We appreciate your patience as we work to deliver a high-quality, stable release. Please check back soon!
Stop using arbitrary thresholds! The new Adaptive Threshold Optimizer determines data-driven significance cutoffs for differential expression analysis.
from raptor.threshold_optimizer import optimize_thresholds
import pandas as pd
df = pd.read_csv('deseq2_results.csv')
result = optimize_thresholds(df, goal='discovery')
print(f"Optimal logFC: {result.logfc_threshold:.2f}")
print(f"Significant genes: {result.n_significant}")
print(f"\n{result.methods_text}") # Publication-ready!Key Features:
- Multiple p-value adjustment methods (BH, BY, Storey q-value, Holm, Bonferroni)
- Five logFC optimization methods (MAD, mixture model, power-based, percentile, consensus)
- Οβ estimation for true null proportion
- Three analysis goals: discovery, balanced, validation
- Auto-generated publication methods text
- Interactive dashboard integration
RAPTOR is a comprehensive framework for benchmarking and optimizing RNA-seq differential expression analysis pipelines. Instead of guessing which pipeline works best for your data, RAPTOR provides evidence-based, ML-powered recommendations through systematic comparison of 8 popular pipelines.
| Challenge | RAPTOR Solution |
|---|---|
| Which pipeline should I use? | β ML recommendations with 87% accuracy |
| What thresholds should I use? | β Adaptive Threshold Optimizer (NEW!) |
| Is my data quality good enough? | β Quality assessment with batch effect detection |
| How do I know results are reliable? | β Ensemble analysis combining multiple pipelines |
| What resources do I need? | β Resource monitoring with predictions |
| How do I present results? | β Automated reports publication-ready |
|
|
# Install
pip install raptor-rnaseq
# Launch dashboard
raptor dashboard
# Opens at http://localhost:8501
# Upload data β Get ML recommendation β Use π― Threshold Optimizer β Done!# Profile your data and get ML recommendation
raptor profile --counts counts.csv --metadata metadata.csv --use-ml
# Run recommended pipeline
raptor run --pipeline 3 --data fastq/ --output results/
# Optimize thresholds (NEW!)
raptor optimize-thresholds --input results.csv --goal balanced
# Generate report
raptor report --results results/ --output report.htmlfrom raptor import RNAseqDataProfiler, MLPipelineRecommender
from raptor.threshold_optimizer import optimize_thresholds
# Profile your data
profiler = RNAseqDataProfiler(counts, metadata)
profile = profiler.run_full_profile()
# Get ML recommendation
recommender = MLPipelineRecommender()
recommendation = recommender.recommend(profile)
print(f"Recommended: Pipeline {recommendation['pipeline_id']}")
print(f"Confidence: {recommendation['confidence']:.1%}")
# After running pipeline, optimize thresholds (NEW!)
de_results = pd.read_csv('de_results.csv')
result = optimize_thresholds(de_results, goal='balanced')
print(f"Optimal |logFC|: {result.logfc_threshold:.2f}")
print(result.methods_text)- Python: 3.8 or higher
- R: 4.0 or higher (for DE analysis)
- RAM: 8GB minimum (16GB recommended)
- Disk: 10GB free space
pip install raptor-rnaseqWith optional dependencies:
# With dashboard support
pip install raptor-rnaseq[dashboard]
# With all features
pip install raptor-rnaseq[all]# Clone repository
git clone https://github.com/AyehBlk/RAPTOR.git
cd RAPTOR
# Install Python dependencies
pip install -r requirements.txt
# Verify installation
python install.pyconda env create -f environment.yml
conda activate raptorRAPTOR benchmarks 8 RNA-seq analysis pipelines:
| ID | Pipeline | Aligner | Quantifier | DE Tool | Speed | ML Rank |
|---|---|---|---|---|---|---|
| 1 | STAR-RSEM-DESeq2 | STAR | RSEM | DESeq2 | ββ | #2 |
| 2 | HISAT2-StringTie-Ballgown | HISAT2 | StringTie | Ballgown | βββ | #5 |
| 3 | Salmon-edgeR β | Salmon | Salmon | edgeR | βββββ | #1 |
| 4 | Kallisto-Sleuth | Kallisto | Kallisto | Sleuth | βββββ | #3 |
| 5 | STAR-HTSeq-limma | STAR | HTSeq | limma-voom | ββ | #4 |
| 6 | STAR-featureCounts-NOISeq | STAR | featureCounts | NOISeq | ββ | #6 |
| 7 | Bowtie2-RSEM-EBSeq | Bowtie2 | RSEM | EBSeq | ββ | #7 |
| 8 | HISAT2-Cufflinks-Cuffdiff | HISAT2 | Cufflinks | Cuffdiff | β | #8 |
β Pipeline 3 (Salmon-edgeR) is the ML model's most frequently recommended pipeline due to its optimal speed/accuracy balance.
RAPTOR/
βββ raptor/ # Core Python package
β βββ profiler.py # Data profiling
β βββ recommender.py # Rule-based recommendations
β βββ ml_recommender.py # ML recommendations
β βββ threshold_optimizer/ # π Adaptive Threshold Optimizer (v2.1.2)
β β βββ __init__.py
β β βββ ato.py # Core ATO class
β β βββ visualization.py # ATO visualizations
β βββ data_quality_assessment.py
β βββ ensemble_analysis.py
β βββ resource_monitoring.py
β βββ ...
βββ dashboard/ # Interactive web dashboard
βββ pipelines/ # Pipeline configurations (8 pipelines)
βββ scripts/ # Workflow scripts (00-10)
βββ examples/ # Example scripts & demos
βββ tests/ # Test suite
βββ docs/ # Documentation
βββ config/ # Configuration templates
βββ install.py # Master installer
βββ launch_dashboard.py # Dashboard launcher
βββ requirements.txt # Python dependencies
βββ setup.py # Package setup
| Document | Description |
|---|---|
| INSTALLATION.md | Detailed installation guide |
| QUICK_START.md | 5-minute quick start |
| DASHBOARD.md | Interactive dashboard guide |
| Document | Description |
|---|---|
| THRESHOLD_OPTIMIZER.md | π Adaptive threshold optimization |
| PROFILE_RECOMMEND.md | Data profiling & recommendations |
| QUALITY_ASSESSMENT.md | Quality scoring & batch effects |
| BENCHMARKING.md | Pipeline benchmarking |
| Document | Description |
|---|---|
| ENSEMBLE.md | Multi-pipeline ensemble analysis |
| RESOURCE_MONITORING.md | Resource tracking |
| CLOUD_DEPLOYMENT.md | AWS/GCP/Azure deployment |
| Document | Description |
|---|---|
| PIPELINES.md | Pipeline details & selection guide |
| API.md | Python API reference |
| FAQ.md | Frequently asked questions |
| CHANGELOG.md | Version history |
from raptor.threshold_optimizer import optimize_thresholds
import pandas as pd
# Load DE results
df = pd.read_csv('deseq2_results.csv')
# Optimize thresholds
result = optimize_thresholds(df, goal='balanced')
print(f"Optimal |logFC|: {result.logfc_threshold:.3f}")
print(f"Significant genes: {result.n_significant}")
print(f"Οβ estimate: {result.pi0:.3f}")
# Get publication methods text
print(result.methods_text)
# Save results
result.results_df.to_csv('optimized_results.csv')from raptor import RNAseqDataProfiler, MLPipelineRecommender
from raptor.threshold_optimizer import optimize_thresholds
import pandas as pd
# 1. Profile data
counts = pd.read_csv('counts.csv', index_col=0)
metadata = pd.read_csv('metadata.csv')
profiler = RNAseqDataProfiler(counts, metadata, use_ml=True)
profile = profiler.profile(quality_check=True)
print(f"Quality Score: {profile['quality_score']}/100")
# 2. Get ML recommendation
recommender = MLPipelineRecommender()
recommendations = recommender.recommend(profile, n=3)
print(f"Recommended: {recommendations[0]['pipeline_name']}")
# 3. [Run recommended pipeline - produces DE results]
# raptor run --pipeline 3 ...
# 4. Optimize thresholds (NEW in v2.1.2)
de_results = pd.read_csv('deseq2_results.csv')
result = optimize_thresholds(
de_results,
logfc_col='log2FoldChange',
pvalue_col='pvalue',
goal='balanced'
)
print(f"\nπ― Optimized Thresholds:")
print(f" LogFC: |{result.logfc_threshold:.3f}|")
print(f" Significant: {result.n_significant} genes")
# 5. Save results with methods text
result.results_df.to_csv('final_results.csv')
with open('methods.txt', 'w') as f:
f.write(result.methods_text)from raptor.ensemble_analysis import EnsembleAnalyzer
from raptor.threshold_optimizer import optimize_thresholds
# Combine results from multiple pipelines
analyzer = EnsembleAnalyzer()
consensus = analyzer.combine_results(
results_dict={'deseq2': df1, 'edger': df2, 'limma': df3},
method='weighted_vote',
min_agreement=2
)
# Use ATO for uniform thresholds across ensemble
result = optimize_thresholds(consensus['combined'], goal='balanced')
print(f"Consensus DE genes: {result.n_significant}")| Metric | Value |
|---|---|
| Overall Accuracy | 87% |
| Top-3 Accuracy | 96% |
| Prediction Time | <0.1s |
| Training Data | 10,000+ analyses |
| Metric | Traditional | With ATO |
|---|---|---|
| Threshold justification | Arbitrary | Data-driven |
| Methods text | Manual | Auto-generated |
| False positives | Higher | Optimized |
| Reproducibility | Variable | Standardized |
We welcome contributions! RAPTOR is open-source and aims to make free science accessible to everyone.
# Fork and clone
git clone https://github.com/YOUR_USERNAME/RAPTOR.git
# Create feature branch
git checkout -b feature/amazing-feature
# Make changes and test
pytest tests/
# Submit pull requestSee CONTRIBUTING.md for guidelines.
If you use RAPTOR in your research, please cite:
@software{bolouki2025raptor,
author = {Bolouki, Ayeh},
title = {RAPTOR: RNA-seq Analysis Pipeline Testing and Optimization Resource},
year = {2025},
version = {2.1.1},
publisher = {Zenodo},
doi = {10.5281/zenodo.17607161},
url = {https://github.com/AyehBlk/RAPTOR}
}This project is licensed under the MIT License - see the LICENSE file for details.
MIT License
Copyright (c) 2025 Ayeh Bolouki
Ayeh Bolouki
- ποΈ GIGA, University of LiΓ¨ge, Belgium
- π§ Email: ayehbolouki1988@gmail.com
- π GitHub: @AyehBlk
- π¬ Research: Computational Biology, Bioinformatics, Multi-omics Analysis
- The Bioconductor community for the R package ecosystem
- All users who provided feedback
β Star this repository if you find RAPTOR useful!
RAPTOR v2.1.2 - Making pipeline selection evidence-based, not guesswork π¦