Skip to content

Data filtering notes

Graham Larue edited this page Dec 6, 2025 · 8 revisions

Data filtering notes

This page documents the filtering criteria applied during intron extraction and scoring, and explains the tagging system used in output files.

Duplicate and isoform filtering

  • When scoring introns, intronIC only processes introns with unique coordinates, and by default only includes introns from the longest isoform for a given gene. If run with -i (to include multiple isoforms), introns with duplicate coordinates are still excluded; in such cases, introns from the longest isoform (computed as the sum of the component coding sequences) will be preferentially included over introns with identical coordinates from shorter isoforms. Duplicate introns may optionally be included in the sequences output file using -d.

Omission criteria

There are a number of criteria by which introns may be omitted from the processed data, depending on run options. These introns will be included in the bed.iic and introns.iic files (and summarized in log.iic), tagged with [o:x] in the intron label where x is one of the following:

Tag Code Description
[o:s] short Introns shorter than 30 nt (default) cannot be scored due to length requirements for the scored sub-sequences. Adjust with --min-intron-len.
[o:n] non-canonical Introns without terminal dinucleotides in the set {GT-AG, GC-AG, AT-AC} are excluded when run with --no-nc.
[o:a] ambiguous Introns with ambiguous characters (e.g., 'N') in scoring regions cannot be properly scored and are excluded.
[o:i] isoform If run without -i, introns not present in the longest isoform are excluded from scoring.
[o:v] overlap Introns with overlapping coordinates (when -v is used) are excluded.

Boundary correction

Non-canonical introns with very strong U12-like 5′ motifs near their annotated start will have their start and stop coordinates corrected (by equal amounts) to reflect the more U12-like splicing boundaries. These introns are tagged with [c:x], where x is the relative coordinate shift applied (e.g., [c:3] means shifted 3 bp downstream).

Why this matters: Some genome annotations have imprecise splice site coordinates, particularly for U12-type introns which are rarer and may have been annotated based on U2-type expectations. By searching a small window around the annotated boundary for strong U12 motifs, intronIC can recover the likely true splice site.

The total number of corrected introns is summarized in log.iic. Furthermore, the features defining such introns (e.g., CDS or exon) within the annotation will have their coordinates adjusted to reflect the new intron boundaries in the annotation.iic output file. This correction can be disabled with --no-nc-ss-adjustment.

Other tags

Tag Description
[n] Non-canonical terminal dinucleotides (not in {GT-AG, GC-AG, AT-AC})
[i] Not from longest transcript isoform
[d] Duplicate coordinates

Summary in log file

The iic.log file provides a summary of all filtering applied:

This allows you to quickly assess data quality and understand why certain introns were not scored.

Clone this wiki locally