-
Notifications
You must be signed in to change notification settings - Fork 1
Technical algorithm
This page provides a detailed technical description of intronIC's algorithm, data flow, and machine learning architecture.
Input: Genome (FASTA) + Annotation (GFF3/GTF) + Species Name
Note: By default, intronIC uses streaming mode, which processes one chromosome at a time for ~85% memory reduction on large genomes.
-
Stage 1: Intron Extraction
- Parse annotation hierarchy (gene → transcript → CDS/exon)
- Infer intron coordinates from exon gaps
- Extract intron + flanking exon sequences from genome
- Filter duplicates, short introns, isoforms (configurable)
-
Stage 2: PWM Scoring
- Score 5' splice site, branch point, 3' splice site regions
- Calculate log-odds ratios:
$\log\left(\frac{P(\text{seq}|\text{U12})}{P(\text{seq}|\text{U2})}\right)$ - Select best branch point position from search window
-
Stage 3: Normalization
- Convert raw log-odds to z-scores via robust scaling
- Fit on reference data or adapt to experimental data
-
Stage 4: Classification
- Feature augmentation (optional composite features)
- Linear SVM with balanced class weights
- Probability calibration via Platt scaling
- Output:
$P(\text{U12-type})$ from 0–100%
Output: .meta.iic, .bed.iic, .introns.iic, .score_info.iic, plots
Introns are inferred from gaps between consecutive exons (or CDS features) within the same transcript:
Exon 1 Intron 1 Exon 2 Intron 2 Exon 3
[==]------------------------------[==]------------------------------[==]
1-100 101-1600 1601-1700 1701-3200 3201-3300
Priority: CDS features are preferred over exon features when available, as they enable phase calculation.
Mixed CDS/Exon Handling: For transcripts with both CDS and exon features, introns are first generated from CDS features, then exon-only introns (typically in UTR regions) are added if they don't overlap existing CDS-derived introns. All introns are sorted by genomic position and assigned sequential indices, ensuring proper ordering (e.g., 5' UTR introns are numbered before CDS introns in coding direction).
Touching Exons (Annotation Artifacts): Some annotations contain adjacent exon features with no gap between them (zero-length "introns"). These are silently skipped and not included in the intron count (family_size). The intron index sequence remains contiguous (1, 2, 3...) with no gaps for these annotation artifacts.
| Filter | Default | Description |
|---|---|---|
| Duplicates | Exclude | Same coordinates from multiple isoforms |
| Longest isoform | Keep only | Can include all with -i
|
| Minimum length | 30 bp | Adjustable via --min-intron-len
|
| Ambiguous bases | Exclude | 'N' in scoring regions |
| Non-canonical | Include | Exclude with --no-nc
|
Position weight matrices (PWMs) capture the probability of observing each nucleotide at each position in a motif. intronIC includes PWMs for:
- U12-type: AT-AC and GT-AG subtypes
- U2-type: GT-AG and GC-AG subtypes
Each subtype has PWMs for all three regions (5' splice site, branch point, 3' splice site).
Default scoring windows relative to splice sites:
| Region | Start | End | Length | Description |
|---|---|---|---|---|
| 5' SS | -3 | +9 | 12 bp | Includes last 3 bp of upstream exon |
| Branch point | -55 | -5 | 50 bp | Search window from 3'SS |
| 3' SS | -6 | +4 | 10 bp | Includes first 4 bp of downstream exon |
For each region, the raw score is a log-odds ratio:
Where:
-
$b_i$ is the nucleotide at position$i$ - Higher positive values favor U12-type
- Higher negative values favor U2-type
- Zero means equally likely under both models
The branch point region is searched for the highest-scoring 7-mer matching the U12 branch point consensus (TCCTTAAC or similar). Multiple adenosine position variants are tested, and the highest-scoring position is selected.
Raw log-odds scores have different ranges and distributions for each region:
- 5'SS scores might range from -50 to +10
- BP scores might range from -20 to +5
- 3'SS scores might range from -5 to +3
Normalization converts these to comparable z-scores for the SVM.
intronIC uses robust z-score normalization via sklearn's RobustScaler:
Where IQR is the interquartile range (
Key properties:
- Robust to outliers: Median and IQR are resistant to extreme values
- Centered: Distribution centered around 0 for each feature
- Comparable scales: All three regions have similar variance after scaling
- Minimal U12 contamination: Rare U12s (~0.5%) don't significantly affect robust statistics
| Mode | Description | Use Case |
|---|---|---|
human |
Use scales from training (human) data | U12-absent species, cross-species |
adaptive |
Refit scales on experimental data | Species with different GC content |
auto |
Use human if available in model | Default behavior |
Note: The pretrained model includes its scaler, so results are reproducible without additional steps when using --normalizer-mode human or auto.
Base features (3D):
-
five_z_score: 5' splice site z-score -
bp_z_score: Branch point z-score -
three_z_score: 3' splice site z-score
Augmented features (optional, 7D default):
-
min_all:$\min(z_{5'}, z_{\text{BP}}, z_{3'})$ — Requires ALL signals strong -
absdiff_5_bp:$|z_{5'} - z_{\text{BP}}|$ — Penalizes 5′/BP imbalance -
absdiff_5_3:$|z_{5'} - z_{3'}|$ — Penalizes 5′/3′ imbalance -
absdiff_bp_3:$|z_{\text{BP}} - z_{3'}|$ — Penalizes BP/3′ imbalance
The augmented features help the linear SVM reject "one-end-strong" false positives—U2-type introns with one unusually U12-like signal.
intronIC uses a linear support vector machine (sklearn's LinearSVC):
- Kernel: Linear (interpretable coefficients)
- Class weights: Balanced to handle ~0.5% U12 prevalence
- Regularization: L1 or L2 penalty (selected via cross-validation), C optimized via grid search
- Convergence: max_iter=50,000, tol=1e-4
The raw SVM outputs decision distances (signed distance from hyperplane). These are converted to probabilities using sklearn's CalibratedClassifierCV. During training, both sigmoid (Platt scaling) and isotonic calibration are evaluated via cross-validation, and the method with lower log-loss is selected.
- Sigmoid (Platt scaling): Fits a logistic function with 2 parameters—assumes sigmoid-shaped relationship between decision function and probability
- Isotonic: Non-parametric monotonically increasing step function—more flexible, typically selected when sufficient training data is available
The default pretrained model uses sigmoid calibration.
When --n-models > 1, multiple SVMs are trained with different U2 subsamples:
- Each model sees all U12 references but a different 80% of U2 references
- Predictions are averaged across models
- Reduces variance and improves robustness
The default pretrained model was trained on:
Training data:
- ~400 conserved human U12-type introns (multiple evidence sources)
- ~20,000 conserved human U2-type introns
Training process:
- Score all reference introns with PWMs
- Normalize using robust scaling (median/IQR)
- Optimize SVM hyperparameters via 5-fold cross-validation
- Train final model on all reference data
- Calibrate probabilities via cross-validation
Evaluation:
- Nested cross-validation for unbiased performance estimates
- Balanced accuracy, F1-score, and PR-AUC reported
- Validated on held-out human introns and cross-species data
Species with different GC content than human may show shifted score distributions. Options:
-
Adaptive normalization: Refit scaler on experimental data (
--normalizer-mode adaptive) -
Prior adjustment: Adjust base rate expectation (
--species-prior)
For species known to lack U12-type introns (C. elegans, many fungi):
intronIC -g genome.fa -a annotation.gff -n species \
--normalizer-mode human --species-prior 1e-6The reduced prior shifts probability thresholds to minimize false positives.
The default human-trained model generalizes well to other vertebrates and most eukaryotes with U12-type introns. Performance may degrade for:
- Very distant lineages (plants, protists)
- Lineages with unusual U12 motifs
- Species with extreme GC bias
Consider providing species-specific reference sequences for best results.
Memory scales with annotation density:
- Loads all intron sequences into memory
- Human genome (~250k introns): ~12 GB peak
Dramatically reduced memory:
- Writes sequences to temporary SQLite database
- Keeps only scoring motifs in memory
- Human genome: ~2-3 GB peak (~85% reduction)
- Trade-off: Slightly slower I/O
The -p N flag parallelizes PWM scoring:
- Scoring is CPU-bound and embarrassingly parallel
- Linear speedup up to ~8-16 cores
- Diminishing returns beyond that
Original intronIC paper:
Moyer DC, Larue GE, Hershberger CE, Roy SW, Padgett RA. (2020) Comprehensive database and evolutionary dynamics of U12-type introns. Nucleic Acids Research 48(13):7066–7078. doi:10.1093/nar/gkaa464
U12-type intron databases:
Alioto TS. (2007) U12DB: a database of orthologous U12-type spliceosomal introns. Nucleic Acids Research 35:D110-D115. doi:10.1093/nar/gkl842
Branch point mapping:
Mercer TR, et al. (2015) Genome-wide discovery of human splicing branchpoints. Genome Research 25:290-303. doi:10.1101/gr.182477.114
Pineda JMB, Bradley RK. (2018) Most human introns are recognized via multiple and tissue-specific branchpoints. Genes & Development 32(7-8):577-591. doi:10.1101/gad.312058.118
U12-type intron retention and functional studies:
Niemelä EH, et al. (2014) Global analysis of the nuclear processing of transcripts with unspliced U12-type introns by the exosome. Nucleic Acids Research 42(11):7358-7369. doi:10.1093/nar/gku391
Madan V, et al. (2015) Aberrant splicing of U12-type introns is the hallmark of ZRSR2 mutant myelodysplastic syndrome. Nature Communications 6:6042. doi:10.1038/ncomms7042
Cologne A, et al. (2019) New insights into minor splicing—a transcriptomic analysis of cells derived from TALS patients. RNA 25(9):1130-1149. doi:10.1261/rna.071423.119
SVM probability calibration:
Platt JC. (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers pp. 61-74.