-
Notifications
You must be signed in to change notification settings - Fork 1
Data filtering notes
This page documents the filtering criteria applied during intron extraction and scoring, and explains the tagging system used in output files.
- When scoring introns,
intronIConly processes introns with unique coordinates, and by default only includes introns from the longest isoform for a given gene. If run with-i(to include multiple isoforms), introns with duplicate coordinates are still excluded; in such cases, introns from the longest isoform (computed as the sum of the component coding sequences) will be preferentially included over introns with identical coordinates from shorter isoforms. Duplicate introns may optionally be included in the sequences output file using-d.
There are a number of criteria by which introns may be omitted from the processed data, depending on run options. These introns will be included in the bed.iic and introns.iic files (and summarized in log.iic), tagged with [o:x] in the intron label where x is one of the following:
| Tag | Code | Description |
|---|---|---|
[o:s] |
short | Introns shorter than 30 nt (default) cannot be scored due to length requirements for the scored sub-sequences. Adjust with --min-intron-len. |
[o:n] |
non-canonical | Introns without terminal dinucleotides in the set {GT-AG, GC-AG, AT-AC} are excluded when run with --no-nc. |
[o:a] |
ambiguous | Introns with ambiguous characters (e.g., 'N') in scoring regions cannot be properly scored and are excluded. |
[o:i] |
isoform | If run without -i, introns not present in the longest isoform are excluded from scoring. |
[o:v] |
overlap | Introns with overlapping coordinates (when -v is used) are excluded. |
Non-canonical introns with very strong U12-like 5′ motifs near their annotated start will have their start and stop coordinates corrected (by equal amounts) to reflect the more U12-like splicing boundaries. These introns are tagged with [c:x], where x is the relative coordinate shift applied (e.g., [c:3] means shifted 3 bp downstream).
Why this matters: Some genome annotations have imprecise splice site coordinates, particularly for U12-type introns which are rarer and may have been annotated based on U2-type expectations. By searching a small window around the annotated boundary for strong U12 motifs, intronIC can recover the likely true splice site.
The total number of corrected introns is summarized in log.iic. Furthermore, the features defining such introns (e.g., CDS or exon) within the annotation will have their coordinates adjusted to reflect the new intron boundaries in the annotation.iic output file. This correction can be disabled with --no-nc-ss-adjustment.
| Tag | Description |
|---|---|
[n] |
Non-canonical terminal dinucleotides (not in {GT-AG, GC-AG, AT-AC}) |
[i] |
Not from longest transcript isoform |
[d] |
Duplicate coordinates |
The iic.log file provides a summary of all filtering applied:
This allows you to quickly assess data quality and understand why certain introns were not scored.