-
Notifications
You must be signed in to change notification settings - Fork 1
Example usage
This page provides practical examples for common intronIC use cases. For full argument documentation, see the Usage info page.
The easiest way to verify your installation:
# Run bundled test (Human Chr19, ~1 min with -p 4)
intronIC test -p 4
# Show test data location on your system
intronIC test --show-onlyIf you prefer to run classification manually with test data:
- Test data is bundled with the package—use
intronIC test --show-onlyto find its location - Alternatively, download the chromosome 19 test files:
The default pretrained model is loaded automatically:
intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz \
-a Homo_sapiens.Chr19.Ensembl_91.gff3.gz \
-n homo_sapiensThis works for virtually all species. You can optionally specify a custom model:
intronIC -g genome.fa -a annotation.gff -n species --model custom.model.pklTo train a model on reference sequences:
intronIC train -n homo_sapiensThis creates a .model.pkl file that can be used for classification; model training (depending on selected options) can take many hours. The default model should serve most users well in most cases.
To extract introns without classification:
intronIC extract -g genome.fa -a annotation.gff -n speciesInformation about the run will be printed to the screen; this same information (plus some additional details) can be found in the log.iic file:
================================================================================
intronIC v2.0.0
Started: 2025-12-08 12:44:39
================================================================================
Command and Configuration:
Command: /home/glarue/code/intronIC/.pixi/envs/default/bin/intronIC -g GCF_000001405.40_GRCh38.p14_genomic.fna.gz -a
GCF_000001405.40_GRCh38.p14_genomic.gff.gz -n homo_sapiens.cds -p 8 -f cds
Working directory: /home/glarue/code/intronIC/run_tests/hsapiens
Run name: homo_sapiens.cds
Input mode: annotation
Classification threshold: 90.0%
Output directory: /home/glarue/code/intronIC/run_tests/hsapiens
Genome: /home/glarue/code/intronIC/run_tests/hsapiens/GCF_000001405.40_GRCh38.p14_genomic.fna.gz
Annotation: /home/glarue/code/intronIC/run_tests/hsapiens/GCF_000001405.40_GRCh38.p14_genomic.gff.gz
Model: /home/glarue/code/intronIC/pretrained.sigmoid.model.pkl
ℹ Streaming mode: processing per-contig
ℹ Loading pretrained model from /home/glarue/code/intronIC/pretrained.sigmoid.model.pkl
Loaded ensemble with 16 models
Extracted frozen scaler from model normalizer
ℹ Loading PWM matrices
ℹ Indexing annotation: GCF_000001405.40_GRCh38.p14_genomic.gff.gz
Indexed 4,932,571 annotations across 705 contigs
ℹ Using indexed genome access: GCF_000001405.40_GRCh38.p14_genomic.fna.gz
ℹ Processing 705 contigs in parallel (8 processes)
Merging output: 202,594 (11.89%) scored + 45,650 (2.68%) omitted = 248,244 (14.56%) total introns for output files
ℹ Streaming classification complete: 202,594 introns classified
Total genes: 55,619, introns generated: 1,704,427
Intron Filtering Summary:
┌────────────────────────────┬────────────┬────────────┐
│ Category │ Included │ Excluded │
├────────────────────────────┼────────────┼────────────┤
│ Duplicates │ 0 │ 1,457,363 │
│ Too short │ 0 │ 240 │
│ Ambiguous bases │ 0 │ 4 │
│ Non-canonical │ 525 │ 0 │
│ Overlapping │ 0 │ 0 │
│ Alternative isoform │ 0 │ 45,211 │
├────────────────────────────┼────────────┼────────────┤
│ Total excluded │ │ 1,502,818 │
│ Retained for scoring │ │ 201,414 │
└────────────────────────────┴────────────┴────────────┘
Classification Results (threshold: 90.0%):
┌──────────────────────┬───────────┬────────────┐
│ Type │ Count │ Percentage │
├──────────────────────┼───────────┼────────────┤
│ U12-type (total) │ 702 │ 0.35% │
│ U12-type (AT-AC) │ 185 │ 0.09% │
│ U2-type │ 201,892 │ 99.65% │
├──────────────────────┼───────────┼────────────┤
│ Total │ 202,594 │ 100.00% │
└──────────────────────┴───────────┴────────────┘
If only the intron sequences are desired, use the extract subcommand which skips classification and produces only a subset of the output files:
intronIC extract -g genome.fa -a annotation.gff -n speciesFor complex or reproducible runs, use a YAML configuration file:
# Generate a template configuration file
intronIC --generate-config > my_config.yaml
# Edit my_config.yaml, then run:
intronIC --config my_config.yaml -g genome.fa -a annotation.gff -n speciesExample configuration:
scoring:
threshold: 90.0
exclude_noncanonical: false
extraction:
flank_length: 100
feature_type: both
training:
n_models: 15
eval_mode: nested_cv
performance:
processes: 8For most species, the default settings work well. In rare cases where you need reproducible normalization across multiple runs on genome subsets:
# First run: fit and save normalizer on full genome
intronIC -g genome.fa -a annotation.gff -n species \
--normalizer-mode adaptive --save-normalizer
# Subsequent runs: reuse the normalizer
intronIC -g subset.fa -a subset.gff -n species \
--load-normalizer species.normalizer.pklNote: This is an advanced feature. For standard analyses, simply use default settings.
Speed up analysis with parallel processes (streaming mode is default and scales efficiently):
intronIC -g genome.fa -a annotation.gff -n species -p 8The -p flag parallelizes the entire extraction and scoring pipeline. With streaming mode (default), using -p 5-10 typically provides 2-3× speedup with moderate memory usage.
Streaming mode is now the default and provides the best balance of speed and memory efficiency:
# Streaming mode is automatic - no flag needed
intronIC -g genome.fa -a annotation.gff -n species
# Human genome example: ~2-6 GB peak memory (depending on -p)
intronIC -g GRCh38.fa.gz -a gencode.gff3.gz -n homo_sapiens -p 8Benefits:
- 85% memory savings vs. in-memory mode
- 2-3× faster with parallel processing
- Smooth progress bars with real-time updates
For very small genomes or debugging, you can opt into in-memory mode:
intronIC -g genome.fa -a annotation.gff -n species --no-streamingThis loads the full genome into memory. Only recommended for genomes <100 MB; it is not appreciably faster than the default streaming mode and uses significantly more memory.
Many additional options exist for a variety of use cases. Run intronIC --help for additional details and/or see the Full usage info page.