Skip to content

Overview

Graham Larue edited this page Dec 6, 2025 · 27 revisions

Overview

intronIC is a modular tool for intron extraction and U12/U2 classification. It has two primary uses:

  1. Classification mode: Score all annotated introns against expectations for U12-type introns, and classify introns as either U2- or U12-type using a support-vector machine (SVM)-based approach.
  2. Extraction mode: Retrieve all annotated intron sequences and associated metadata (using intronIC extract).

intronIC supports multiple input modes:

  • Genome + annotation (GFF3/GTF) for comprehensive analysis
  • Genome + BED file for coordinate-based extraction
  • Pre-extracted sequences (.iic format) for classification-only runs

Scientific Background

Minor vs Major Spliceosomes

Eukaryotic pre-mRNA splicing is catalyzed by two distinct spliceosomes:

Major (U2-dependent) spliceosome — Splices ~99.5% of introns

  • Recognizes GT-AG (most common) or GC-AG terminal dinucleotides
  • Uses U1, U2, U4, U5, U6 snRNPs
  • Branch point consensus: loose, A within ~18-40 nt of 3'SS

Minor (U12-dependent) spliceosome — Splices ~0.5% of introns

  • Recognizes both AT-AC (~25%) and GT-AG (~75%) terminal dinucleotides
  • Uses U11, U12, U4atac, U6atac, U5 snRNPs
  • Branch point consensus: highly conserved TCCTTAAC motif, typically 11-13 nt from 3'SS

Despite their rarity, U12-type introns are functionally important and evolutionarily conserved across most eukaryotic lineages. Their loss has been documented in several lineages including C. elegans, certain fungi, and some protists (Alioto 2007; Moyer et al. 2020).

Classification basics

By default, introns with probability scores >90% are classified as U12-type (which means a relative score >0). This score threshold is likely to be conservative for most species, which hopefully means relatively few false-positives but possibly some number of false-negatives.

One useful sanity check can be to examine the plot.scatter.iic.png and plot.hex.iic.png figures by eye to see whether there is clear separation between putative intron types.

For example, here is the score scatter plot for introns in human (each point is an intron, and the x and y axes are the z-scores for the 5'SS and BPS motifs, respectively), with introns classified as U12-type at a probability >90% in green:

Here are the same kinds of plots for species with significant (D. melanogaster) and complete (C. elegans) minor intron loss:

D. melanogaster C. elegans

The C. elegans plot highlights the presence of a handful of spuriously U12-like introns (red)—previously described in the literature—whose separation from the main cluster of introns is much less clear than in the case of the true positives in D. melanogaster.

Classification method

intronIC uses a three-stage pipeline for intron classification:

  1. PWM Scoring — Apply position-weight matrices to three key regions to calculate log-odds ratios:

    • 5' splice site (donor): -3 to +9 relative to intron start
    • Branch point: Search window -55 to -5 from 3'SS
    • 3' splice site (acceptor): -6 to +4 relative to intron end
  2. Normalization — Convert raw scores to z-scores using robust scaling (median/IQR). See Technical Details for details.

  3. SVM Classification — Linear SVM with balanced class weights outputs probability scores (0-100%)

For more details on the algorithm, including normalization, feature augmentation, and the pretrained model architecture, see the Technical Details page.

Intron scoring applies to unique introns only

Importantly, by default intronIC only processes introns with unique coordinates from the longest annotated isoform for each gene (though this behavior is adjustable). Therefore, the same intron from multiple isoforms will only be included once, and named based upon the longest isoform. See Training data and PWMs and Data filtering notes for additional caveats.

Brief method summary

intronIC should be able to process most annotations (even kinda crappy ones) provided they roughly adhere to GFF3/GTF formatting standards, and produce a set of output files described in detail on the Output files page. In order to work with intronIC, the annotation file must have parent information in the last column, such that all features (CDS or exon) from the same transcript/gene can be associated with one another. Beyond that requirement, the parser should be fairly flexible.

At a high level, intronIC works by aggregating all of the CDS/exon sequences under their parent transcripts and/or genes based on the parent-child relationships given by the last column in the annotation file. CDS features are used preferentially, as they allow intron phase to be determined, but exon features are also used in cases where they define unique introns (unless run with -f cds).

Then, assuming the classification mode is being used, it will score the introns against configurable position-weight matrices and classify them using a pretrained SVM model (trained on curated human U2/U12 reference data). The output includes intron sequences, metadata, and classification information. For specialized use cases, users can also train custom models using the train subcommand.

Basic usage

A typical default scoring run uses both CDS and exon features to define introns, includes introns with non-canonical splice boundaries, and uses the built-in pretrained model:

intronIC -g {genome} -a {annotation} -n {binomial_name}

For the sample data:

intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens

Advanced features

intronIC v2 includes several advanced capabilities:

  • YAML configuration files for managing complex parameter sets (see Example usage)
  • Pretrained models with adaptive normalization for cross-species classification
  • Species-specific prior adjustment for U12-absent or U12-poor lineages (--species-prior)
  • Ensemble training with cross-validation for robust model building (--n-models)
  • Parallel processing for improved performance (-p)
  • Streaming mode (enabled by default) for ~85% memory reduction on large genomes

Streaming mode (default)

Streaming mode is now the default behavior, dramatically reducing memory usage for large genomes (e.g., vertebrates with >200,000 annotated introns):

# Human genome: ~2 GB instead of ~12 GB peak memory
intronIC -g GRCh38.fa.gz -a gencode.gff3.gz -n homo_sapiens -p 8

Streaming mode writes intron sequences to temporary storage during extraction, keeping only scoring motifs in memory. Progress is tracked per annotation rather than per contig. To disable streaming mode, use --no-streaming.

See Quick start for further instructions on getting set up and checking your installation on test data.

Clone this wiki locally