Skip to content

Quick start

Graham Larue edited this page Dec 15, 2025 · 40 revisions

Quick start/testing

Installation

Using pip (recommended)

Install the last stable version from PyPI:

python -m pip install intronIC

Or install the latest version directly from GitHub:

python -m pip install git+https://github.com/glarue/intronIC

To upgrade to the latest version:

python -m pip install git+https://github.com/glarue/intronIC --upgrade

Using pixi (for development)

Pixi manages all dependencies automatically:

# Install pixi
curl -fsSL https://pixi.sh/install.sh | bash

# Clone and set up
git clone https://github.com/glarue/intronIC.git
cd intronIC
pixi install
pixi run intronIC --help

From source

Clone the repository and install in development mode:

git clone https://github.com/glarue/intronIC.git
cd intronIC
pip install -e .

Verifying Installation

After installing, verify it works with the bundled test data:

# Quick installation test (~1 minute with -p 4)
intronIC test -p 4

# Show where test data is located
intronIC test --show-only

This runs a smoke test to ensure intronIC is working correctly.

Dependencies

intronIC requires Python 3.10+ and the following packages:

  • numpy >=1.19.0 — Numerical operations
  • scipy >=1.5.0 — Scientific computing
  • scikit-learn >=0.22 — SVM classifier
  • biogl >=3.0 — Bioinformatics utilities
  • matplotlib (optional) — Plotting
  • rich (optional) — Progress bars
  • pyyaml (optional) — Configuration files

All required dependencies are installed automatically by pip.

intronIC was developed on Linux and has only been minimally tested on macOS and Windows.

Useful arguments

The required arguments for any classification run include a name (-n; see note below), along with:

  1. Genome (-g) and annotation/BED (-a, -b) files or,
  2. Intron sequences file (-q) (see Training data and PWMS for formatting information, which matches the reference sequence format)

By default, intronIC includes non-canonical introns, considers only the longest isoform of each gene, and uses streaming mode for memory efficiency. Helpful arguments may include:

  • -p parallel processes, which can significantly reduce runtime

  • -f cds use only CDS features to identify introns (by default, uses both CDS and exon features)

  • --no-nc exclude introns with non-canonical (non-GT-AG/GC-AG/AT-AC) boundaries

  • -i include introns from multiple isoforms of the same gene (default: longest isoform only)

  • --no-streaming disable streaming mode (uses more memory but avoids temporary storage)

  • --config path to YAML configuration file for advanced settings

Configuration files

intronIC supports YAML configuration files for managing complex runs. Configuration files are searched in this order:

  1. Path specified by --config
  2. .intronIC.yaml in current directory
  3. ~/.config/intronIC/config.yaml
  4. ~/.intronIC.yaml in home directory
  5. Built-in defaults

CLI arguments always override config file values. To generate a template configuration file:

intronIC --generate-config > my_config.yaml

Example configuration:

scoring:
  threshold: 90.0
  exclude_noncanonical: false

extraction:
  flank_length: 100
  feature_type: both

performance:
  processes: 8

Use with:

intronIC --config my_config.yaml -g genome.fa -a annotation.gff -n species

Running on test dataset

The easiest way to test intronIC is with the bundled test data:

intronIC test -p 4

This automatically uses the included chromosome 19 test data and verifies your installation.

Manual test with custom data

If you prefer to manually test with specific files:

  • If you have installed via pip, the test data is bundled with the package. Use intronIC test --show-only to see the location, or download the chromosome 19 FASTA and GFF3 sample files into a directory of your choice.

  • If you have cloned the repo, first change to the src/intronIC/data/test_data subdirectory, which contains Ensembl annotations and sequence for chromosome 19 of the human genome. From the repo root, you can run python -m intronIC instead of intronIC in the following examples.

Classify annotated introns

intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens

The various output files contain different information about each intron; information can be cross-referenced by using the intron label (usually the first column of the file). U12-type introns are those (by default) with probability scores >90%, or equivalently (depending on the output file) relative scores >0. For example, here is an example U12-type AT-AC intron from the meta.iic file:

HomSap-ENSG00000141837@ENST00000614285_1(47);[c:-1]     10.0    AT-AC   GCC|ATATCCTTTT...TTTTCCTTAATT/TTTTTCCTTAAT...AATAC|TCC  CACCTCCAACACCCTTCTTTTCTTTGAACAAGAT[TTTTCCTTAATT]CCCCAATAC     50719   ENST00000614285 ENSG00000141837 1       47      0.0     2       u12     cds     corrected

To retrieve all U12-type introns from this file, one can filter based on the relative score (2nd column; U12-type introns have relative scores >0), e.g.

awk '($2!="NA" && $2>0)' homo_sapiens.meta.iic

Extract all annotated intron sequences

If you just want to retrieve all annotated intron sequences (without classification), use the extract subcommand:

intronIC extract -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens

See the rest of the Wiki for more extensive details about output files, usage info, etc.

A note on the -n (name) argument

By default, intronIC expects names in binomial (genus, species) form separated by a non-alphanumeric character, e.g. 'homo_sapiens', 'homo.sapiens', etc. intronIC then formats that name internally into a tag that it uses to label all output intron IDs, ignoring anything past the second non-alphanumeric character.

Output files, on the other hand, are named using the full name supplied via -n. If you'd prefer to have it leave whatever argument you supply to -n unmodified, use the --na flag.

If you are running multiple versions of the same species and would like to keep the same species abbreviations in the output intron data, simply add a tag to the end of the name, e.g. "homo_sapiens.v2"; the tags within files will be consistent ("HomSap"), but the file names across runs will be distinct.

Resource usage

Memory

Memory usage scales with genome annotation density. Typical requirements:

  • Human genome with --streaming (default): ~2-3 GB peak memory
  • Human genome (GRCh38): ~18 GB peak memory (standard mode)
  • Non-model genomes: 1-5 GB depending on annotation density

Memory usage is primarily driven by the number of annotated introns rather than genome size. For large genomes, use --streaming mode (the default) to dramatically reduce memory requirements.

Streaming mode (default)

Streaming mode is now enabled by default. It writes intron sequences to temporary storage during extraction, keeping only the scoring motifs in memory. This reduces peak memory usage by ~85% (e.g., 11 GB → 2 GB for human genome).

To disable streaming mode (uses more memory but avoids temporary storage):

intronIC -g genome.fa -a annotation.gff -n species --no-streaming

Runtime

Runtime scales approximately linearly with the number of annotated introns:

  • Human genome: ~6-10 minutes (-p 8, ~250k introns)
  • Non-model genomes: typically 2-5 minutes (-p 8)
  • Small test datasets: 1-2 minutes

Note: These estimates are for classification with the default model. Model training with intronIC train can take significantly longer (minutes to hours) depending on configuration options.

Using parallel processes (-p) can significantly reduce runtime for the scoring phase, though extraction is primarily I/O-bound.