Evolutionary constraint analysis of alternative translation initiation site (aTIS) regions using gnomAD variants.
This analysis compares constraint metrics between:
- aTIS regions: 5'UTR extensions and N-terminal truncations
- Canonical CDS: Standard protein-coding sequences
Following the approach from Whiffin et al. 2024 (Genome Biology), we use:
- Simple paired comparisons (aTIS vs canonical within same gene)
- LOEUF stratification to test correlation with gene-level constraint
Source: SwissIsoform MANE isoform results
- 4,757 aTIS features (2,963 extensions + 1,794 truncations)
- gnomAD v4.1 variant counts and constraint metrics (GRCh38)
Only essential columns retained:
- Identifiers: gene_name, transcript_id, feature_id, feature_type
- Feature info: feature_start, feature_end, feature_length_aa
- gnomAD variant counts: missense, synonymous, nonsense, frameshift
See METHODS.md for detailed methodology.
Constraint metrics calculated for both aTIS and canonical regions:
-
O/E Ratios (Observed/Expected)
- Missense, Synonymous, LoF O/E
- aTIS expected counts scaled from gnomAD canonical using length ratio:
exp_aTIS = exp_canonical × (AA_aTIS / AA_canonical) - O/E < 1 = constrained (fewer variants than expected)
- O/E > 1 = tolerant (more variants than expected)
-
Variant Densities (variants per amino acid)
- Missense, Synonymous, LoF densities
-
Paired Comparisons (within-gene)
- Delta: aTIS - canonical
- Ratio: aTIS / canonical
- Transcript-based merge: SwissIsoform features matched to gnomAD v4.1 constraint metrics via Ensembl transcript IDs (version-stripped)
- Quality filters: Requires gnomAD match, LOEUF score, valid lengths, complete constraint metrics
- Result: 4,757 features from 2,593 genes across 2,950 transcripts with paired aTIS/canonical metrics
# 1. Prepare data (download gnomAD v4.1, merge with SwissIsoform features)
python src/01_prepare_data.py
# Output: data/merged_features.csv (4,757 features, 29 columns)
# 2. Calculate constraint metrics (O/E ratios, densities, paired comparisons)
python src/02_calculate_metrics.py
# Output: data/features_with_metrics.csv (4,757 features with all metrics)
# 3. Visualize distributions (exploratory plots)
python src/03_plot_metrics.py
# Output: results/figures/{all,extensions,truncations}/Status: Exploratory analysis complete. Statistical testing (step 04) pending.
From visual inspection of distributions (results/figures/):
-
aTIS regions show higher O/E ratios than canonical CDS
- aTIS O/E distributions shifted right (higher O/E = less constrained)
- Most aTIS regions have O/E > 1 (more variants than expected)
-
Paired comparisons show consistent patterns
- Median delta O/E > 0 for most features (aTIS less constrained)
- Small fraction (~10-15%) show negative delta (aTIS more constrained than canonical)
-
Extensions vs Truncations
- Both feature types show similar constraint patterns
- Separate plots enable visual comparison
Note: These are descriptive observations. Formal statistical testing (paired tests, LOEUF stratification) pending in step 04.
data/
├── gnomad/
│ └── gnomad.v4.1.constraint_metrics.tsv # Auto-downloaded (92MB)
├── merged_features.csv # SwissIsoform + gnomAD v4.1 merged (4,757 features, 29 columns)
└── features_with_metrics.csv # All calculated constraint metrics (4,757 features)
results/figures/
├── all/ # Plots for all 4,757 features
│ ├── 01_oe_distributions.png
│ ├── 02_density_distributions.png
│ ├── 03_oe_scatter.png
│ └── 04_delta_distributions.png
├── extensions/ # Plots for 2,963 extensions
│ └── [same 4 plots]
└── truncations/ # Plots for 1,794 truncations
└── [same 4 plots]
-
Methodology: Inspired by Whiffin et al. 2024. "Differences in 5'untranslated regions highlight the importance of translational regulation of dosage sensitive genes." Genome Biology 25:111. DOI: 10.1186/s13059-024-03248-0
-
Data sources:
- Chen S, Francioli LC, Goodrich JK, et al. (2024) A genome-wide mutational constraint map quantified from variation in 76,156 human genomes. bioRxiv. [gnomAD v4.1, GRCh38]
- SwissIsoform MANE alternative isoform database
-
Detailed methods: See METHODS.md