Output files

intronIC will automatically generate a set of output files. A brief description of the contents of each file follows (numbered lists represent the columns within each file).

Note: The meta.iic and score_info.iic files include column headers by default. Use --no-headers to omit headers if needed for compatibility with legacy pipelines.

For many use-cases, the most important files will be:

introns.iic - full intron sequences, with U12-type probability score in 2nd column
bed.iic - BED file of intron coordinates, with U12-type probability score in 5th column
meta.iic - metadata file with intron info such as parent transcript/gene, oridinal index, phase, position as % of coding sequence, etc.

`annotation.iic`

If putatively misannotated introns are found, intronIC will correct their coordinates by adjusting the features that define them in this file. These entries will contain a 'shift' tag that indicates the change made to either their start or stop coordinate (or both). This file is only made if misannotated introns are found (otherwise, it would be identical to the original annotation).

`demoted.iic`

NOTE: this option is disabled by default; see command line arguments for details

Scoring information for putative U12s whose scores fell below the threshold after their boundaries were switched to GT-AG (as a check to avoid scoring non-canonical introns as U12-type by way of superficial similarities to U12-type motifs):

label
initial score (with five and bp scores in parentheses), followed by reduced score after boundary switching

`dupe_map.iic`

A mapping table of unique, scored intron labels and their corresponding duplicate intron labels:

HomSap-gene-DIAPH2@rna-NM_006729.5_23(26)     HomSap-gene-DIAPH2@rna-NM_007309.4_23(26);[o:i];[i]

scored intron label
duplicate intron label

`introns.iic`

All of the annotated intron sequences, including any introns not meeting scoring criteria (e.g. too short, including non-ATCG characters in scoring regions, etc.; includes duplicate introns if run with -d):

HomSap-gene-DIAPH2@rna-NM_006729.5_23(26)      100.0   TTTCAGCTCAAATTCTCAAGAGCAACCTTGCATCAATGGAACAACAAATTGTTCATCTGGAACGTGACATCAAGAAATTCCCCCAAGCAGAAAATCAACACGATAAGTTTGTGGAAAAGATGACC        ATATCCTTTATTTAT[...]TTAACAAAAAGCTAC       AGCTTTACAAAGACTGCCCGAGAACAGTATGAAAAACTCTCCACCATGCACAACAACATGATGAAGCTCTATGAGAATCTTGGAGAATACTTCATTTTTGACTCAAAGACAGTGAGCATAGAAGAGTTCTTTGGTGATCTCAACAACTTCCGAACTTTGTTTTTG

label (without score tag)
score (or '.' if run with -s or otherwise unscored)
upstream (5′ exon) sequence (100 nt by default, configurable via --flank-len)
intron sequence
downstream (3′ exon) sequence (100 nt by default, configurable via --flank-len)

`bed.iic`

A BED format file of intron coordinates with U12 probabilty scores and labels:

NC_000023.11      97247840        97348115        HomSap-gene-DIAPH2@rna-NM_006729.5_23(26);9.999% 99.99999999997891    +

genomic region (e.g. 'chr1')
start coordinate (0-indexed)
end coordinate (1-indexed)
label (including rounded score tag)
U12-type probability score (0-100)
strand

Note: The BED file is not created when using sequence-only input (-q flag), as BED format requires real genomic coordinates.

`log.iic`

A log of all of the information generated during operation, including total number of introns processed, excluded, etc., and total number of U12-type introns identified.

`pwms.iic`

A FASTA file of the PWMs used, including those built from the experimental dataset (not including pseudocount values).

`meta.iic`

A hodgepodge of other data about each intron (ordered by increasing U12 score):

HomSap-gene-DIAPH2@rna-NM_006729.5_23(26)        10.0    AT-AC   ACC|ATATCCTTTA...TGTTCCTTAACA/ATGTTCCTTAAC...GCTAC|AGC  TGATTGATTGCCTTTAAAAGGTACTGTTGAGCCA[TGTTCCTTAACA]AAAAGCTAC    100276  rna-NM_006729.5 gene-DIAPH2     23      26      86.025  0       u12     cds

label
relative score based upon score threshold (U2 <= 0 < U12)
terminal dinucleotides (e.g. GT-AG, AT-AC)
motif string (5', U12/U2 BPS, 3')
BPS location in context of 3' end
length (bp)
parent transcript
parent gene
ordinal position in transcript
total introns in transcript
fractional position in transcript as a percentage of the coding length, e.g. 50.0 for an intron that interrupts the coding sequence between codons 15 and 16 out of 30.
phase (0, between codons; 1, after the first base of a codon; 2, after the second base of a codon)
binary classification made by the classifier (u12 or u2), which may include introns of various probabilities within each class (i.e. introns labeled "u12" by the classifier may include introns with probabilities significantly lower than the specified threshold)
genomic feature used to define the intron (e.g. cds, exon)

`score_info.iic`

Various scoring information (in order of increasing score).

HomSap-gene-DIAPH2@rna-NM_006729.5_23(26)        9.999999999978911       99.99999999997891       14.009663556553722      ACCATATCCTTT    32.68142988959584    5.7780213971590415      TGTTCCTTAACA    ATGTTCCTTAAC    12.230872477795534      3.1791686948829034      AGCTACAGCT      15.970632948856013      8.375420832149185

intron label
relative score (maximum precision); U2 <= 0 < U12
SVM-assigned U12 probability score (0-100) (averaged across N SVM classifiers if --n-models N)
5′ sequence used for scoring
5′ log-ratio score
5′ z-score
U12-type branch point sequence used for scoring
U2-type branch point sequence used for scoring
branch point log-ratio score
branch point z-score
3' sequence used for scoring
3' log-ratio score
3' z-score
distance from hyperplane, e.g. the raw classifier output prior to scikit-learn's implementation of Platt scaling to convert distances to probabilities.

Understanding the scores

SVM Score (0-100)

The primary classification score representing the probability that an intron is U12-type:

Score Range	Interpretation
>90	High confidence U12-type (default threshold)
50-90	Intermediate confidence
<50	More likely U2-type
<10	High confidence U2-type

Relative Score

relative_score = svm_score - threshold

This makes filtering easier:

Positive values (>0): Above threshold → classified as U12-type
Negative values (<0): Below threshold → classified as U2-type
Magnitude: Distance from the decision boundary

Raw vs Z-Scores

Raw scores (log-odds ratios): log₁₀(P(seq|U12) / P(seq|U2)) — Different ranges for each region
Z-scores: Normalized for comparison — Unit variance, centered around reference distribution

See Technical Details for information on the normalization approach.

Common operations

Finding U12-type introns

# From meta.iic (using relative score)
awk '($2!="." && $2>0)' species.meta.iic

# From bed.iic (using SVM score)
awk '$5 > 90' species.bed.iic

# Count total U12-type introns
awk '($2!="." && $2>0)' species.meta.iic | wc -l

Extracting specific types

# AT-AC introns only
awk '($3 == "AT-AC")' species.meta.iic

# High-confidence U12-type AT-AC introns
awk '($2 > 5 && $3 == "AT-AC")' species.meta.iic

# U12-type introns in first half of transcript
awk '($2>0 && $11 < 50)' species.meta.iic

Converting to FASTA

# All U12-type intron sequences
awk '$2 > 90 {print ">"$1"\n"$4}' species.introns.iic > u12.fasta

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output files

Output files

`annotation.iic`

`demoted.iic`

`dupe_map.iic`

`introns.iic`

`bed.iic`

`log.iic`

`pwms.iic`

`meta.iic`

`score_info.iic`

Understanding the scores

SVM Score (0-100)

Relative Score

Raw vs Z-Scores

Common operations

Finding U12-type introns

Extracting specific types

Converting to FASTA

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally