Skip to content

LBGC-CFB/MORFAL

Repository files navigation

MORFAL - Molecular Open Reading Frame AnaLyser

Logo MORFAL


MORFAL is a Python-based command-line pipeline designed to analyze genetic variants located in the 5′ untranslated region (5′UTR) and within 200 bp downstream of the canonical start codon (AUG).

📌 Overview

MORFAL generates wild-type and mutated sequences for genes with a mutation located around the canonical start codon (AUG) using SAMtools, scores all AUG codons using NetStart and calculates a delta-score between wild-type et mutated sequences scores, annotates variants using UTRAnnotator, and produces a final table summarizing all results. The pipeline is python script and command-lines based and uses Singularity containers to ensure reproducibility.

✨ Features

  • Extraction of sequences (WT and mutated) for variants in UTR and near-start-codon
  • Automatic handling of strand orientation and reverse complement sequences
  • Scoring of canonical and alternative ATG using NetStart
  • Δ-score computation for evaluating variant impact
  • Annotation of UTR variants using UTRAnnotator
  • Optional integration of MORFEE database
  • Final tabulated output summarizing all variant-level information

⚙️ Dependencies

  • Python 3.8
  • SAMtools 1.14
  • Singularity 3.5.3
  • MORFAL container with Netstart, UTRAnnotator and MORFAL script
  • UTRAnnotator directory with UTRAnnotator annotation file (uORF_5UTR_GRChxx_PUBLIC.txt), vep cache files
  • (Optional) MORFEE database et MORFFEE index
  • Reference genome (FASTA)
  • GTF file generated by Gencode in hg19 v19 and hg38 v48 or greater (choose Comprehensive gene annotation GTF format)

🚀 Installation

The tools and the MORFAL script are contained within the quay.io container.

singularity pull docker://quay.io/crystal_renaud/morfal:latest

If you wish to use MORFEE, please create the indexing file using create_index_MORFEE.py. Create a folder to store the indexes. The MORFEEdb file and the indexing file must be split using the following command line:

awk -F'\t' '{file=$1".txt "; print >> file }' path/to/MORFEEdb.txt
awk -F'\t' '{file=$1".txt.tai "; print >> file }' path/to/MORFEEdb_ind.txt.tai

📥 Required Input Arguments

Argument Description
--gtf Path to GTF annotation file
--vcf Path to input VCF file
-gen_fa, --genome_fasta Path to reference genome FASTA
-vep_cache, --vep_cache Path to vep directory
-vep_file, --UTRannotator_file Path to UTRAnnotator file
--genome Genome assembly (GRCh37 or GRCh38)
-MORFEE, --MORFEEdb (optional) Path to MORFEE database. Default: None
-MORFEE_ind, --MORFEEdb_index (optional) Path to MORFEE database index. Default: None
-O, --outdir Output directory path

If the MORFEE database paths and MORFEE database index path are not provided, the database-related step is automatically skipped.

🧬 Workflow Summary

1. GTF Processing and Sequence Extraction

  • GTF is loaded into a Python dictionary.
  • VCF is parsed and each variant located within the 5′UTR or the 200 bp downstream of the start codon is selected.
  • For each selected variant:
    • FASTA sequence is extrated using SAMtools.
    • Reverse complement is generated if needed.
    • Wild-type (WT) and mutated sequences are written into a single FASTA file: seq2netstart.fasta
  • Initial variant metadata is stored in final_table.

2. NetStart Analysis and Δ-score Computation

  • seq2netstart.fasta is processed by NetStart :
/usr/bin/tcsh /opt/netstart-1.0c/netstart -vert fasta/path
  • NetStart assigns a score to each ATG codons.
  • Δ-scores between WT and mutated sequences are computed.
  • Minimum, maximum, and canonical ATG Δ-scores are stored in final_table.
  • All ATG with a NetStart score > 0.5 are written to positive_prediction.txt.
  • All delta score different from zero are written to delta.txt.
  • The variants for the selected region are transcribed in UTR_FirstExons.vcf.

3. UTR Variant Annotation (UTRAnnotator)

  • UTR_FirstExons.vcf is used as the input file.
  • UTRAnnotator is run via Singularity:
    /opt/vep/src/ensembl-vep/vep --fork 4 --dir path/to/UTRAnnotator_directory --fasta genome_fasta --cache --offline \
                    --merged --format vcf --tab --force_overwrite --af_gnomade --assembly genome --use_transcript_ref \
                    --plugin UTRAnnotator,file=path/to/UTRAnnotator_file -i path/to/outdir/UTR_FirstExons.vcf -o path/to/outdir/UTR.annotated.vcf
    
  • Annotated VCF output is parsed and relevant information is added to final_table.

4. Final Output

All collected information is merged into a tab-delimited summary file containing:

  • metadata of variant
  • NetStart scores and Δ-scores
  • UTRAnnotator annotations
  • optional MORFEE database hits

Optional: MORFEE Database Integration

If both --MORFEE and --MORFEE_ind are provided:

  • MORFAL queries the MORFEE database
  • annotations are appended to the final output If not provided, this step is skipped automatically.

🚀 Basic Command Usage

Example invocation:

cd path/to/morfal_directory
singularity exec morfal_latest.sif \
   morfal \
   -gtf path/to/gencode.v19.annotation.gtf \
   -vcf path/to/variants.vcf \
   -gen_fa path/to/genome.fa \
   -vep_cache path/to/vep_directory \
   -vep_file path/to/UTRAnnotator_file \
   -genome GRCh37 \
   -MORFEE path/to/directory/morfee.db \
   -MORFEE_ind path/to/directory/morfee.index \
   -O path/to/results/

To run MORFAL without the MORFEE database:

cd path/to/morfal_directory
singularity exec morfal_latest.sif \
    morfal \
    -gtf path/to/gencodeV29.gtf \
    -vcf path/to/variants.vcf \
    -gen_fa path/to/genome.fa \
    -vep_cache path/to/vep_directory \
    -vep_file path/to/UTRAnnotatorfile \
    -genome GRCh38 \
    -O path/to/results/

📤 Output Files

File Description
seq2netstart.fasta WT and mutated sequences used for NetStart
prediction_positive.txt Netstart Scores > 0.5
delta.txt All delta score different from zero
UTR_FirstExons.vcf Extracted UTR and near-start-codon variants
UTR.annotated.vcf Output VCF from UTRAnnotator
UTR.annotated.vcf_summary.html Output html from UTRAnnotator
output_annotation.txt Tab-delimited table with all variant annotations

📄 License

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors