Skip to content

Example usage

Graham Larue edited this page Dec 16, 2025 · 13 revisions

Example usage

This page provides practical examples for common intronIC use cases. For full argument documentation, see the Usage info page.

Quick test

The easiest way to verify your installation:

# Run bundled test (Human Chr19, ~1 min with -p 4)
intronIC test -p 4

# Show test data location on your system
intronIC test --show-only

Test data for manual runs

If you prefer to run classification manually with test data:

  • Test data is bundled with the package—use intronIC test --show-only to find its location
  • Alternatively, download the chromosome 19 test files:

Basic usage

Classification (recommended for most users)

The default pretrained model is loaded automatically:

intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz \
         -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz \
         -n homo_sapiens

This works for virtually all species. You can optionally specify a custom model:

intronIC -g genome.fa -a annotation.gff -n species --model custom.model.pkl

Training a new model

To train a model on reference sequences:

intronIC train -n homo_sapiens

This creates a .model.pkl file that can be used for classification; model training (depending on selected options) can take many hours. The default model should serve most users well in most cases.

Extracting intron sequences only

To extract introns without classification:

intronIC extract -g genome.fa -a annotation.gff -n species

Information about the run will be printed to the screen; this same information (plus some additional details) can be found in the log.iic file:

================================================================================
intronIC v2.0.0
Started: 2025-12-08 12:44:39
================================================================================

Command and Configuration:
  Command: /home/glarue/code/intronIC/.pixi/envs/default/bin/intronIC -g GCF_000001405.40_GRCh38.p14_genomic.fna.gz -a
GCF_000001405.40_GRCh38.p14_genomic.gff.gz -n homo_sapiens.cds -p 8 -f cds
  Working directory: /home/glarue/code/intronIC/run_tests/hsapiens
  Run name: homo_sapiens.cds
  Input mode: annotation
  Classification threshold: 90.0%
  Output directory: /home/glarue/code/intronIC/run_tests/hsapiens
  Genome: /home/glarue/code/intronIC/run_tests/hsapiens/GCF_000001405.40_GRCh38.p14_genomic.fna.gz
  Annotation: /home/glarue/code/intronIC/run_tests/hsapiens/GCF_000001405.40_GRCh38.p14_genomic.gff.gz
  Model: /home/glarue/code/intronIC/pretrained.sigmoid.model.pkl

ℹ Streaming mode: processing per-contig
ℹ Loading pretrained model from /home/glarue/code/intronIC/pretrained.sigmoid.model.pkl
Loaded ensemble with 16 models
Extracted frozen scaler from model normalizer
ℹ Loading PWM matrices
ℹ Indexing annotation: GCF_000001405.40_GRCh38.p14_genomic.gff.gz
Indexed 4,932,571 annotations across 705 contigs
ℹ Using indexed genome access: GCF_000001405.40_GRCh38.p14_genomic.fna.gz
ℹ Processing 705 contigs in parallel (8 processes)
Merging output: 202,594 (11.89%) scored + 45,650 (2.68%) omitted = 248,244 (14.56%) total introns for output files
ℹ Streaming classification complete: 202,594 introns classified
Total genes: 55,619, introns generated: 1,704,427

Intron Filtering Summary:
┌────────────────────────────┬────────────┬────────────┐
│ Category                   │ Included   │ Excluded   │
├────────────────────────────┼────────────┼────────────┤
│   Duplicates               │          0 │  1,457,363 │
│   Too short                │          0 │        240 │
│   Ambiguous bases          │          0 │          4 │
│   Non-canonical            │        525 │          0 │
│   Overlapping              │          0 │          0 │
│   Alternative isoform      │          0 │     45,211 │
├────────────────────────────┼────────────┼────────────┤
│ Total excluded             │            │  1,502,818 │
│ Retained for scoring       │            │    201,414 │
└────────────────────────────┴────────────┴────────────┘


Classification Results (threshold: 90.0%):
┌──────────────────────┬───────────┬────────────┐
│ Type                 │ Count     │ Percentage │
├──────────────────────┼───────────┼────────────┤
│ U12-type (total)     │       702 │      0.35% │
│ U12-type (AT-AC)     │       185 │      0.09% │
│ U2-type              │   201,892 │     99.65% │
├──────────────────────┼───────────┼────────────┤
│ Total                │   202,594 │    100.00% │
└──────────────────────┴───────────┴────────────┘

Sequence extraction only

If only the intron sequences are desired, use the extract subcommand which skips classification and produces only a subset of the output files:

intronIC extract -g genome.fa -a annotation.gff -n species

Using configuration files

For complex or reproducible runs, use a YAML configuration file:

# Generate a template configuration file
intronIC --generate-config > my_config.yaml

# Edit my_config.yaml, then run:
intronIC --config my_config.yaml -g genome.fa -a annotation.gff -n species

Example configuration:

scoring:
  threshold: 90.0
  exclude_noncanonical: false

extraction:
  flank_length: 100
  feature_type: both

training:
  n_models: 15
  eval_mode: nested_cv

performance:
  processes: 8

Advanced: Custom normalization (rarely needed)

For most species, the default settings work well. In rare cases where you need reproducible normalization across multiple runs on genome subsets:

# First run: fit and save normalizer on full genome
intronIC -g genome.fa -a annotation.gff -n species \
         --normalizer-mode adaptive --save-normalizer

# Subsequent runs: reuse the normalizer
intronIC -g subset.fa -a subset.gff -n species \
         --load-normalizer species.normalizer.pkl

Note: This is an advanced feature. For standard analyses, simply use default settings.

Parallel processing

Speed up analysis with parallel processes (streaming mode is default and scales efficiently):

intronIC -g genome.fa -a annotation.gff -n species -p 8

The -p flag parallelizes the entire extraction and scoring pipeline. With streaming mode (default), using -p 5-10 typically provides 2-3× speedup with moderate memory usage.

Memory modes

Streaming mode (default)

Streaming mode is now the default and provides the best balance of speed and memory efficiency:

# Streaming mode is automatic - no flag needed
intronIC -g genome.fa -a annotation.gff -n species

# Human genome example: ~2-6 GB peak memory (depending on -p)
intronIC -g GRCh38.fa.gz -a gencode.gff3.gz -n homo_sapiens -p 8

Benefits:

  • 85% memory savings vs. in-memory mode
  • 2-3× faster with parallel processing
  • Smooth progress bars with real-time updates

In-memory mode (legacy)

For very small genomes or debugging, you can opt into in-memory mode:

intronIC -g genome.fa -a annotation.gff -n species --no-streaming

This loads the full genome into memory. Only recommended for genomes <100 MB; it is not appreciably faster than the default streaming mode and uses significantly more memory.


Many additional options exist for a variety of use cases. Run intronIC --help for additional details and/or see the Full usage info page.

Clone this wiki locally