Single molecule RNA FISH probe design software from the Raj Lab.
ProbeDesign creates oligonucleotide probes for single molecule RNA FISH experiments. It selects optimal probe sequences based on:
- Thermodynamic properties: Targets specific Gibbs free energy values for RNA-DNA hybridization
- Sequence masking: Avoids cross-hybridization by masking regions that align to pseudogenes or repetitive genomic sequences
- Optimal spacing: Uses dynamic programming to find the best probe placements
# Clone the repository
git clone https://github.com/arjunrajlab/ProbeDesign.git
cd ProbeDesign
# Install with pip (requires Python 3.8+)
pip install -e .
# Verify installation
probedesign --versionMasking requires Bowtie 1 and genome indexes. See BOWTIE.md for detailed setup instructions.
Important: Install bowtie via conda/mamba (NOT homebrew, which has a different tool with the same name).
# Install bowtie via conda/mamba
mamba install bowtie
# Verify it's the bioinformatics bowtie
bowtie --version # Should show "bowtie-align-s version 1.x.x"
# Create indexes directory
mkdir -p bowtie_indexes && cd bowtie_indexes
# Build pseudogene index from included FASTA files
bowtie-build ../probedesign/pseudogeneDBs/human.fasta humanPseudo
# Download human genome index (~3GB)
curl -O ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/GRCh38_no_alt.zip
unzip GRCh38_no_alt.zipFor other species, build the corresponding pseudogene index:
bowtie-build ../probedesign/pseudogeneDBs/mouse.fasta mousePseudo
bowtie-build ../probedesign/pseudogeneDBs/elegans.fasta celegansPseudo
bowtie-build ../probedesign/pseudogeneDBs/drosophila.fasta drosophilaPseudo# Basic usage (no masking)
probedesign design input.fa
# With pseudogene and genome masking (recommended)
probedesign design input.fa --pseudogene-mask --genome-mask --index-dir bowtie_indexes
# With automatic repeat masking (requires RepeatMasker installed locally)
probedesign design input.fa --repeatmask
# With manual repeat masking (using RepeatMasker output file)
probedesign design input.fa --repeatmask-file input_repeatmasked.fa
# Full example with all options
probedesign design input.fa \
--probes 48 \
--oligo-length 20 \
--spacer-length 2 \
--target-gibbs -23 \
--allowable-gibbs -26,-20 \
--repeatmask \
--pseudogene-mask \
--genome-mask \
--species human \
--index-dir bowtie_indexes \
--output MyProbesprobedesign design [OPTIONS] INPUT_FILE
Arguments:
INPUT_FILE FASTA file containing target sequence
Options:
-n, --probes INTEGER Number of probes to design (default: 48)
-l, --oligo-length INT Length of each oligonucleotide (default: 20)
-s, --spacer-length INT Minimum gap between probes (default: 2)
-g, --target-gibbs FLOAT Target Gibbs free energy in kcal/mol (default: -23)
--allowable-gibbs TEXT Allowable Gibbs FE range as min,max (default: -26,-20)
-o, --output TEXT Output file prefix (default: input filename)
-q, --quiet Suppress output to stdout
--species [human|mouse|elegans|drosophila|rat]
Species for masking databases (default: human)
--repeatmask Run RepeatMasker automatically (requires local install)
--repeatmask-file PATH FASTA file with N's marking repeat regions (manual)
--pseudogene-mask Mask regions that align to pseudogenes
--genome-mask Mask repetitive genomic regions
--index-dir PATH Directory containing bowtie indexes
ProbeDesign generates two output files:
Tab-separated file with probe details:
Index GC% Tm Gibbs Sequence Name
1 55 66.2 -24.7 gtctcagaagctgcgattcg MyProbes_1
2 65 63.2 -26.0 gaatgctgggcgcgcgaaag MyProbes_2
...
Visual alignment of probes against the input sequence, showing masked regions and probe positions.
Masks regions that align to known pseudogene sequences. Uses 16-mer alignment with bowtie. Short runs (<18bp) are removed to avoid spurious masking.
Masks repetitive genomic regions using multiple alignment stringencies:
- 12-mer with threshold 4000 hits (very repetitive)
- 14-mer with threshold 500 hits
- 16-mer with threshold 20 hits
If you have RepeatMasker installed locally, ProbeDesign can run it automatically:
probedesign design input.fa --repeatmask --probes 32Quick Setup (Human/Mouse):
# Install RepeatMasker
mamba install -c bioconda -c conda-forge repeatmasker
# Download Mammalia database (~8.9GB compressed, ~56GB extracted)
cd /opt/homebrew/Caskroom/miniforge/base/share/RepeatMasker/Libraries/famdb
curl -O https://www.dfam.org/releases/current/families/FamDB/dfam39_full.7.h5.gz
gunzip dfam39_full.7.h5.gzNote: The database requires ~65GB disk space. See REPEATMASKER.md for detailed setup instructions, troubleshooting, and the web-based alternative if disk space is limited.
For sequences with repetitive elements, you can provide a pre-masked file:
Option A: Use RepeatMasker's website
- Upload your FASTA to RepeatMasker's website
- Download the masked output file (contains N's where repeats are)
- Use the
--repeatmask-fileoption:
Option B: Run RepeatMasker locally
- Run:
RepeatMasker -species human input.fa - Use the output:
probedesign design input.fa --repeatmask-file input.fa.masked
probedesign design input.fa --repeatmask-file input_repeatmasked.fa --output MyProbesNote: You cannot use both --repeatmask and --repeatmask-file together.
Masking databases are available for: human, mouse, elegans, drosophila, rat
ProbeDesign uses Sugimoto 1995 parameters for RNA-DNA hybrid thermodynamics:
- Gibbs Free Energy: Calculated at 37°C
- Melting Temperature: Calculated with salt=0.33M, primer concentration=50µM
- Default target: -23 kcal/mol (range: -26 to -20)
For HCR probes (longer oligos), use different parameters:
probedesign design input.fa -l 52 --target-gibbs -60 --allowable-gibbs -80,-40The original MATLAB implementation is in probedesign/findprobesLocal.m. It requires MATLAB and has additional features like repeat masking via RepeatMasker.
% Basic usage
findprobesLocal('input.fa', 48)
% With options
findprobesLocal('input.fa', 32, ...
'outfilename', 'MyProbes', ...
'pseudogenemask', true, ...
'genomemask', true, ...
'repeatmask', false, ...
'species', 'human')Python CLI (recommended):
Option 1 - Automatic (requires local RepeatMasker):
probedesign design mCherry.fasta --repeatmask --probes 32Option 2 - Manual (using pre-masked file):
probedesign design mCherry.fasta --repeatmask-file mCherry_repeatmasked.fa --output mCherry_probesSee REPEATMASKER.md for RepeatMasker installation instructions.
MATLAB:
findprobesLocal('mCherry.fasta', 32, ...
'repeatmask', true, ...
'repeatmaskmanual', true, ...
'repeatmaskfile', 'mCherry_repeatmasked.fa', ...
'species', 'mouse')See TEST.md for validation instructions and test cases.
ProbeDesign/
├── src/probedesign/ # Python package
│ ├── cli.py # Command-line interface
│ ├── core.py # Main probe design algorithm
│ ├── thermodynamics.py # Gibbs FE and Tm calculations
│ ├── masking.py # Bowtie-based sequence masking
│ ├── sequence.py # Sequence utilities
│ ├── fasta.py # FASTA I/O
│ └── output.py # Output file generation
├── probedesign/ # MATLAB code
│ ├── findprobesLocal.m # Main MATLAB function
│ └── pseudogeneDBs/ # Pseudogene FASTA files
├── DesignServer/ # Legacy Python server code
├── test_cases/ # Validation test cases
├── bowtie_indexes/ # Genome indexes (gitignored)
├── pyproject.toml # Python package config
├── BOWTIE.md # Bowtie setup instructions
├── REPEATMASKER.md # RepeatMasker setup instructions
└── TEST.md # Testing documentation
- Read input: Parse FASTA file, clean sequence
- Calculate badness: For each position, compute squared distance from target Gibbs FE
- Apply masking: Mark regions aligning to pseudogenes/repetitive sequences as infinite badness
- Dynamic programming: Find optimal probe placements minimizing average badness
- Generate output: Create oligo and sequence alignment files
MIT License - See LICENSE file for details.
If you use this software, please cite the Raj Lab.