-
Notifications
You must be signed in to change notification settings - Fork 1
Quick start
Install the last stable version from PyPI:
python -m pip install intronICOr install the latest version directly from GitHub:
python -m pip install git+https://github.com/glarue/intronICTo upgrade to the latest version:
python -m pip install git+https://github.com/glarue/intronIC --upgradePixi manages all dependencies automatically:
# Install pixi
curl -fsSL https://pixi.sh/install.sh | bash
# Clone and set up
git clone https://github.com/glarue/intronIC.git
cd intronIC
pixi install
pixi run intronIC --helpClone the repository and install in development mode:
git clone https://github.com/glarue/intronIC.git
cd intronIC
pip install -e .After installing, verify it works with the bundled test data:
# Quick installation test (~1 minute with -p 4)
intronIC test -p 4
# Show where test data is located
intronIC test --show-onlyThis runs a smoke test to ensure intronIC is working correctly.
intronIC requires Python 3.10+ and the following packages:
-
numpy
>=1.19.0— Numerical operations -
scipy
>=1.5.0— Scientific computing -
scikit-learn
>=0.22— SVM classifier -
biogl
>=3.0— Bioinformatics utilities - matplotlib (optional) — Plotting
- rich (optional) — Progress bars
- pyyaml (optional) — Configuration files
All required dependencies are installed automatically by pip.
intronIC was developed on Linux and has only been minimally tested on macOS and Windows.
The required arguments for any classification run include a name (-n; see note below), along with:
- Genome (
-g) and annotation/BED (-a,-b) files or, - Intron sequences file (
-q) (see Training data and PWMS for formatting information, which matches the reference sequence format)
By default, intronIC includes non-canonical introns, considers only the longest isoform of each gene, and uses streaming mode for memory efficiency. Helpful arguments may include:
-
-pparallel processes, which can significantly reduce runtime -
-f cdsuse onlyCDSfeatures to identify introns (by default, uses bothCDSandexonfeatures) -
--no-ncexclude introns with non-canonical (non-GT-AG/GC-AG/AT-AC) boundaries -
-iinclude introns from multiple isoforms of the same gene (default: longest isoform only) -
--no-streamingdisable streaming mode (uses more memory but avoids temporary storage) -
--configpath to YAML configuration file for advanced settings
intronIC supports YAML configuration files for managing complex runs. Configuration files are searched in this order:
- Path specified by
--config -
.intronIC.yamlin current directory ~/.config/intronIC/config.yaml-
~/.intronIC.yamlin home directory - Built-in defaults
CLI arguments always override config file values. To generate a template configuration file:
intronIC --generate-config > my_config.yamlExample configuration:
scoring:
threshold: 90.0
exclude_noncanonical: false
extraction:
flank_length: 100
feature_type: both
performance:
processes: 8Use with:
intronIC --config my_config.yaml -g genome.fa -a annotation.gff -n speciesThe easiest way to test intronIC is with the bundled test data:
intronIC test -p 4This automatically uses the included chromosome 19 test data and verifies your installation.
If you prefer to manually test with specific files:
-
If you have installed via
pip, the test data is bundled with the package. UseintronIC test --show-onlyto see the location, or download the chromosome 19 FASTA and GFF3 sample files into a directory of your choice. -
If you have cloned the repo, first change to the
src/intronIC/data/test_datasubdirectory, which contains Ensembl annotations and sequence for chromosome 19 of the human genome. From the repo root, you can runpython -m intronICinstead ofintronICin the following examples.
intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens
The various output files contain different information about each intron; information can be cross-referenced by using the intron label (usually the first column of the file). U12-type introns are those (by default) with probability scores >90%, or equivalently (depending on the output file) relative scores >0. For example, here is an example U12-type AT-AC intron from the meta.iic file:
HomSap-ENSG00000141837@ENST00000614285_1(47);[c:-1] 10.0 AT-AC GCC|ATATCCTTTT...TTTTCCTTAATT/TTTTTCCTTAAT...AATAC|TCC CACCTCCAACACCCTTCTTTTCTTTGAACAAGAT[TTTTCCTTAATT]CCCCAATAC 50719 ENST00000614285 ENSG00000141837 1 47 0.0 2 u12 cds corrected
To retrieve all U12-type introns from this file, one can filter based on the relative score (2nd column; U12-type introns have relative scores >0), e.g.
awk '($2!="NA" && $2>0)' homo_sapiens.meta.iicIf you just want to retrieve all annotated intron sequences (without classification), use the extract subcommand:
intronIC extract -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens
See the rest of the Wiki for more extensive details about output files, usage info, etc.
By default, intronIC expects names in binomial (genus, species) form separated by a non-alphanumeric character, e.g. 'homo_sapiens', 'homo.sapiens', etc. intronIC then formats that name internally into a tag that it uses to label all output intron IDs, ignoring anything past the second non-alphanumeric character.
Output files, on the other hand, are named using the full name supplied via -n. If you'd prefer to have it leave whatever argument you supply to -n unmodified, use the --na flag.
If you are running multiple versions of the same species and would like to keep the same species abbreviations in the output intron data, simply add a tag to the end of the name, e.g. "homo_sapiens.v2"; the tags within files will be consistent ("HomSap"), but the file names across runs will be distinct.
Memory usage scales with genome annotation density. Typical requirements:
-
Human genome with
--streaming(default): ~2-3 GB peak memory - Human genome (GRCh38): ~18 GB peak memory (standard mode)
- Non-model genomes: 1-5 GB depending on annotation density
Memory usage is primarily driven by the number of annotated introns rather than genome size. For large genomes, use --streaming mode (the default) to dramatically reduce memory requirements.
Streaming mode is now enabled by default. It writes intron sequences to temporary storage during extraction, keeping only the scoring motifs in memory. This reduces peak memory usage by ~85% (e.g., 11 GB → 2 GB for human genome).
To disable streaming mode (uses more memory but avoids temporary storage):
intronIC -g genome.fa -a annotation.gff -n species --no-streamingRuntime scales approximately linearly with the number of annotated introns:
-
Human genome: ~6-10 minutes (
-p 8, ~250k introns) -
Non-model genomes: typically 2-5 minutes (
-p 8) - Small test datasets: 1-2 minutes
Note: These estimates are for classification with the default model. Model training with intronIC train can take significantly longer (minutes to hours) depending on configuration options.
Using parallel processes (-p) can significantly reduce runtime for the scoring phase, though extraction is primarily I/O-bound.