MORFAL is a Python-based command-line pipeline designed to analyze genetic variants located in the 5′ untranslated region (5′UTR) and within 200 bp downstream of the canonical start codon (AUG).
MORFAL generates wild-type and mutated sequences for genes with a mutation located around the canonical start codon (AUG) using SAMtools, scores all AUG codons using NetStart and calculates a delta-score between wild-type et mutated sequences scores, annotates variants using UTRAnnotator, and produces a final table summarizing all results. The pipeline is python script and command-lines based and uses Singularity containers to ensure reproducibility.
- Extraction of sequences (WT and mutated) for variants in UTR and near-start-codon
- Automatic handling of strand orientation and reverse complement sequences
- Scoring of canonical and alternative ATG using NetStart
- Δ-score computation for evaluating variant impact
- Annotation of UTR variants using UTRAnnotator
- Optional integration of MORFEE database
- Final tabulated output summarizing all variant-level information
- Python 3.8
- SAMtools 1.14
- Singularity 3.5.3
- MORFAL container with Netstart, UTRAnnotator and MORFAL script
- UTRAnnotator directory with UTRAnnotator annotation file (uORF_5UTR_GRChxx_PUBLIC.txt), vep cache files
- (Optional) MORFEE database et MORFFEE index
- Reference genome (FASTA)
- GTF file generated by Gencode in hg19 v19 and hg38 v48 or greater (choose Comprehensive gene annotation GTF format)
The tools and the MORFAL script are contained within the quay.io container.
singularity pull docker://quay.io/crystal_renaud/morfal:latest
If you wish to use MORFEE, please create the indexing file using create_index_MORFEE.py. Create a folder to store the indexes. The MORFEEdb file and the indexing file must be split using the following command line:
awk -F'\t' '{file=$1".txt "; print >> file }' path/to/MORFEEdb.txt
awk -F'\t' '{file=$1".txt.tai "; print >> file }' path/to/MORFEEdb_ind.txt.tai
| Argument | Description |
|---|---|
| --gtf | Path to GTF annotation file |
| --vcf | Path to input VCF file |
| -gen_fa, --genome_fasta | Path to reference genome FASTA |
| -vep_cache, --vep_cache | Path to vep directory |
| -vep_file, --UTRannotator_file | Path to UTRAnnotator file |
| --genome | Genome assembly (GRCh37 or GRCh38) |
| -MORFEE, --MORFEEdb (optional) | Path to MORFEE database. Default: None |
| -MORFEE_ind, --MORFEEdb_index (optional) | Path to MORFEE database index. Default: None |
| -O, --outdir | Output directory path |
If the MORFEE database paths and MORFEE database index path are not provided, the database-related step is automatically skipped.
1. GTF Processing and Sequence Extraction
- GTF is loaded into a Python dictionary.
- VCF is parsed and each variant located within the 5′UTR or the 200 bp downstream of the start codon is selected.
- For each selected variant:
- FASTA sequence is extrated using SAMtools.
- Reverse complement is generated if needed.
- Wild-type (WT) and mutated sequences are written into a single FASTA file:
seq2netstart.fasta
- Initial variant metadata is stored in
final_table.
2. NetStart Analysis and Δ-score Computation
seq2netstart.fastais processed by NetStart :
/usr/bin/tcsh /opt/netstart-1.0c/netstart -vert fasta/path
- NetStart assigns a score to each ATG codons.
- Δ-scores between WT and mutated sequences are computed.
- Minimum, maximum, and canonical ATG Δ-scores are stored in
final_table. - All ATG with a NetStart score > 0.5 are written to
positive_prediction.txt. - All delta score different from zero are written to
delta.txt. - The variants for the selected region are transcribed in
UTR_FirstExons.vcf.
3. UTR Variant Annotation (UTRAnnotator)
UTR_FirstExons.vcfis used as the input file.- UTRAnnotator is run via Singularity:
/opt/vep/src/ensembl-vep/vep --fork 4 --dir path/to/UTRAnnotator_directory --fasta genome_fasta --cache --offline \ --merged --format vcf --tab --force_overwrite --af_gnomade --assembly genome --use_transcript_ref \ --plugin UTRAnnotator,file=path/to/UTRAnnotator_file -i path/to/outdir/UTR_FirstExons.vcf -o path/to/outdir/UTR.annotated.vcf - Annotated VCF output is parsed and relevant information is added to
final_table.
4. Final Output
All collected information is merged into a tab-delimited summary file containing:
- metadata of variant
- NetStart scores and Δ-scores
- UTRAnnotator annotations
- optional MORFEE database hits
If both --MORFEE and --MORFEE_ind are provided:
- MORFAL queries the MORFEE database
- annotations are appended to the final output If not provided, this step is skipped automatically.
Example invocation:
cd path/to/morfal_directory
singularity exec morfal_latest.sif \
morfal \
-gtf path/to/gencode.v19.annotation.gtf \
-vcf path/to/variants.vcf \
-gen_fa path/to/genome.fa \
-vep_cache path/to/vep_directory \
-vep_file path/to/UTRAnnotator_file \
-genome GRCh37 \
-MORFEE path/to/directory/morfee.db \
-MORFEE_ind path/to/directory/morfee.index \
-O path/to/results/
To run MORFAL without the MORFEE database:
cd path/to/morfal_directory
singularity exec morfal_latest.sif \
morfal \
-gtf path/to/gencodeV29.gtf \
-vcf path/to/variants.vcf \
-gen_fa path/to/genome.fa \
-vep_cache path/to/vep_directory \
-vep_file path/to/UTRAnnotatorfile \
-genome GRCh38 \
-O path/to/results/
| File | Description |
|---|---|
seq2netstart.fasta |
WT and mutated sequences used for NetStart |
prediction_positive.txt |
Netstart Scores > 0.5 |
delta.txt |
All delta score different from zero |
UTR_FirstExons.vcf |
Extracted UTR and near-start-codon variants |
UTR.annotated.vcf |
Output VCF from UTRAnnotator |
UTR.annotated.vcf_summary.html |
Output html from UTRAnnotator |
output_annotation.txt |
Tab-delimited table with all variant annotations |
