A Python package for designing specific molecular assays targeting particular taxa while excluding others.
- Automated Gene Selection: Intelligently selects the most suitable gene markers based on sequence availability, copy number, and evolutionary properties
- Phylogenetic-Aware Exclusion: Automatically identify related taxa for exclusion using phylogenetic distance weighting
- Flexible Exclusion Strategies: Choose from genus-level, family-level, order-level, or custom exclusion approaches
- Multi-Domain Support: Automatic detection and handling of Bacteria, Archaea, Eukaryota, and Viruses
- Gene Suitability Scoring: Evaluates candidate genes based on sequence availability, copy number stability, HGT resistance, and optimal length
- Fetch sequence data from NCBI: Retrieve genomic and gene sequences for any taxonomic ID
- Find conserved marker regions: Identify taxa-specific conserved regions
- Design and validate primers: Automated primer design for marker regions
- Command-line interface: Easy-to-use CLI with extensive customization options
pip install assay_designgit clone https://github.com/yourusername/assay_design.git
cd assay_design
pip install -e .Design an assay for a specific organism (using taxid 689 for Vibrio mediterranei as an example):
# Fully automated: auto-select gene AND exclusion taxa
assay-design --inclusion 689 --email your.email@example.com --auto-gene --exclusion-strategy intelligent --output-dir ./results
# Use family-level exclusion strategy (broader exclusion scope)
assay-design --inclusion 689 --email your.email@example.com --auto-gene --exclusion-strategy family --output-dir ./results
# Genus-level (sibling species) exclusion
assay-design --inclusion 689 --email your.email@example.com --auto-gene --exclusion-strategy siblings --output-dir ./results# Let the system choose the best gene automatically
assay-design --inclusion 689 --email your.email@example.com --auto-gene --exclusion-strategy intelligent --output-dir ./results
# Customize gene selection for specific use cases
assay-design --inclusion 689 --email your.email@example.com --auto-gene --gene-use-case quantification --output-dir ./results
# View available genes and their suitability scores
assay-design --inclusion 689 --email your.email@example.com --suggest-genes --output-dir ./results# Intelligent multi-level exclusion (recommended) - phylogenetic-aware selection
assay-design --inclusion 689 --email your.email@example.com --exclusion-strategy intelligent --max-exclusion-taxa 10 --output-dir ./results
# Genus-level exclusion (siblings) - excludes sibling species in the same genus
assay-design --inclusion 689 --email your.email@example.com --exclusion-strategy siblings --output-dir ./results
# Family-level exclusion - excludes related genera in the same family
assay-design --inclusion 689 --email your.email@example.com --exclusion-strategy family --output-dir ./results
# Order-level exclusion - excludes related families in the same order
assay-design --inclusion 689 --email your.email@example.com --exclusion-strategy order --output-dir ./results
# Manual exclusion - manually specify taxa to exclude
assay-design --inclusion 689 --exclusion 717,670,672 --email your.email@example.com --output-dir ./results# Specify a particular gene
assay-design --inclusion 689 --gene "rpoB" --email your.email@example.com --exclusion-strategy intelligent --output-dir ./results
# Specify gene with custom exclusion
assay-design --inclusion 689 --gene "16S ribosomal RNA" --exclusion 717,670,672 --email your.email@example.com --output-dir ./results# Disable LSH optimization for traditional k-mer comparison
assay-design --inclusion 689 --email your.email@example.com --auto-gene --exclusion-strategy intelligent --no-lsh --output-dir ./results
# Combine multiple options for fine-tuned control
assay-design --inclusion 689 --email your.email@example.com \
--auto-gene \
--gene-use-case phylogeny \
--exclusion-strategy family \
--max-exclusion-taxa 15 \
--output-dir ./results
# Control gene selection fallback behavior
assay-design --inclusion 689 --email your.email@example.com \
--auto-gene \
--max-genes-to-try 5 \
--min-gene-score 0.6 \
--output-dir ./resultsfrom assay_design.gene_selection import (
auto_select_gene_for_taxon,
GeneSelectionCriteria
)
from assay_design.hierarchical_search import intelligent_exclusion_selection
from assay_design.data_retrieval import fetch_gene_sequences
from assay_design.target_identification import find_conserved_marker
from assay_design.primer_design import design_primers
# Step 1: Automatically select the best gene
gene_result = auto_select_gene_for_taxon(
taxid="689", # Vibrio mediterranei
email="your.email@example.com"
)
selected_gene = gene_result['gene']
print(f"Selected gene: {selected_gene} (score: {gene_result['total_score']:.2f})")
# Step 2: Intelligently select exclusion taxa using phylogenetic distance weighting
exclusion_taxa = intelligent_exclusion_selection(
taxid="689",
email="your.email@example.com",
strategy="genus", # or "family", "order"
phylo_distance_weight=0.8,
max_exclusion_taxa=5
)
exclusion_taxids = [t['taxid'] for t in exclusion_taxa]
# Step 3: Fetch sequences for inclusion and exclusion
inclusion_sequences = fetch_gene_sequences(
taxid="689",
gene_name=selected_gene,
email="your.email@example.com",
max_records=10
)
exclusion_sequences = []
for taxid in exclusion_taxids:
sequences = fetch_gene_sequences(
taxid=taxid,
gene_name=selected_gene,
email="your.email@example.com",
max_records=5
)
exclusion_sequences.extend(sequences)
# Step 4: Find conserved markers
marker_info = find_conserved_marker(
inclusion_sequences=inclusion_sequences,
exclusion_sequences=exclusion_sequences
)
# Step 5: Design primers
primers = design_primers(marker_info)
print(f"Designed primers: {primers}")from assay_design.gene_selection import (
rank_candidate_genes,
evaluate_gene_suitability,
GeneSelectionCriteria,
BACTERIA_GENES
)
# Define custom gene selection criteria
custom_criteria = GeneSelectionCriteria(
min_sequence_count=20,
ideal_length_range=(1000, 1800),
single_copy_preferred=True,
hgt_resistant_preferred=True
)
# Evaluate all candidate genes
gene_scores = []
for gene_name, gene_info in BACTERIA_GENES.items():
result = evaluate_gene_suitability(
taxid="689",
gene_name=gene_name,
email="your.email@example.com",
criteria=custom_criteria
)
gene_scores.append(result)
# Rank genes by suitability
ranked_genes = rank_candidate_genes(gene_scores)
# Use the top-ranked gene
best_gene = ranked_genes[0]
print(f"Best gene: {best_gene['gene']}")
print(f"Score: {best_gene['total_score']:.2f}")
print(f"Sequences available: {best_gene['sequence_count']}")from assay_design.data_retrieval import fetch_sequences_for_taxid, get_related_taxa
from assay_design.target_identification import find_conserved_marker
from assay_design.primer_design import design_primers
# Fetch sequences for inclusion taxid
inclusion_sequences = fetch_sequences_for_taxid(
taxid="689", # Vibrio mediterranei
email="your.email@example.com",
max_records=10
)
# Get related taxa for exclusion with phylogenetic distance weighting
related_taxa = get_related_taxa(
taxid="689",
email="your.email@example.com",
relationship="sibling",
max_results=5,
phylo_distance_weight=0.8 # Prioritize phylogenetically closer taxa
)
exclusion_taxids = [taxon["taxid"] for taxon in related_taxa]
# Fetch sequences for exclusion taxids
exclusion_sequences = []
for taxid in exclusion_taxids:
sequences = fetch_sequences_for_taxid(
taxid=taxid,
email="your.email@example.com",
max_records=5
)
exclusion_sequences.extend(sequences)
# Find conserved markers
marker_info = find_conserved_marker(
inclusion_sequences=inclusion_sequences,
exclusion_sequences=exclusion_sequences
)
# Design primers
primers = design_primers(marker_info)
print(f"Designed primers: {primers}")The automated gene selection feature evaluates candidate genes based on multiple criteria to identify the most suitable marker for your target taxon:
-
Sequence Availability: Evaluates how many sequences are available in NCBI databases
- Optimal: 20+ sequences available
- Score decreases for sparse sequence data
-
Copy Number Stability: Prefers single-copy genes to avoid PCR bias
- Single-copy genes: Full score
- Multi-copy genes (e.g., rRNA): Partial score
-
HGT (Horizontal Gene Transfer) Resistance: Prioritizes genes with low HGT rates
- Core housekeeping genes: Higher score
- Mobile or frequently transferred genes: Lower score
-
Sequence Length: Targets optimal length range for primer design
- Ideal: 800-1500 bp
- Penalties for sequences too short (<500 bp) or too long (>3000 bp)
The package includes curated gene databases for different taxonomic domains:
Bacteria (30 genes):
- rpoB, rpoD, gyrB, recA, dnaK, groEL, atpD, fusA
- 16S rRNA, 23S rRNA, infB, tuf, rplB, rpsB
- And 16 additional housekeeping genes
Archaea (20 genes):
- rpoB, rpoA1, rpoA2, EF-2, SecY, RecA
- 16S rRNA, 23S rRNA, and 12 additional marker genes
Eukaryota (20 genes):
- 18S rRNA, 28S rRNA, ITS, COI, CytB
- actin, tubulin, EF-1α, and 12 additional markers
Viruses (15 genes):
- RdRp, capsid, polymerase, and 12 viral-specific markers
You can customize the gene selection criteria using the GeneSelectionCriteria class:
from assay_design.gene_selection import GeneSelectionCriteria
custom_criteria = GeneSelectionCriteria(
min_sequence_count=15, # Minimum sequences required
ideal_length_range=(1000, 1800), # Optimal length range (bp)
single_copy_preferred=True, # Prefer single-copy genes
hgt_resistant_preferred=True # Prefer HGT-resistant genes
)The package provides multiple exclusion strategies to match your assay design goals:
-
Intelligent (recommended): Multi-level phylogenetic-aware selection
- Automatically selects exclusion taxa across multiple taxonomic levels
- Balances close and distant relatives for robust specificity
- Uses
intelligent_exclusion_selection()with phylogenetic distance weighting - Control the number with
--max-exclusion-taxa(default: 10)
-
Siblings (genus-level): Excludes sibling species within the same genus
- Best for species-specific assays
- Example: Targeting Vibrio mediterranei, excludes other Vibrio species
-
Family-level: Excludes related genera within the same family
- Best for genus-specific assays
- Example: Targeting Vibrio genus, excludes other Vibrionaceae genera
-
Order-level: Excludes related families within the same order
- Best for family-specific assays
- Example: Targeting Vibrionaceae, excludes other Vibrionales families
-
Manual: Use only user-specified exclusion taxa
- Full control over which taxa to exclude
- Specify with
--exclusion taxid1,taxid2,taxid3
By default, the package uses Locality-Sensitive Hashing (LSH) for efficient sequence comparison when identifying specific marker regions. This significantly improves performance for large datasets.
If you need to use the traditional k-mer comparison method (e.g., for debugging or validation), you can disable LSH with the '--no-lsh' flag:
assay-design --inclusion 689 --email your.email@example.com --no-lsh- Python 3.7+
- Biopython
- External tools (optional but recommended):
- MAFFT, MUSCLE, or ClustalW for multiple sequence alignment
- Primer3 for advanced primer design (the package includes a basic implementation)
This project is licensed under the MIT License - see the LICENSE file for details.