Scripts to generate a consensus inserted sequence of insertions called by Savana.
The consusensus is based on multi-alignment of the fasta file produced by Savana for insertion calls. The script preprocess_insert_sequence_fa.sh takes as an input the .inserted_sequences_savana.fa file generated by Savana. It then splits it per variant is run through consensus_insert_sequence.py. The script run_consensus_test.array.sh runs the Python script through all variants of one patient on the Crick cluster.
- SeqKit grep for pre-processing of the inserted sequences file
- Mafft for multiple alignment
- Python libraries the file
sequence_ins_env.ymlhas all required libraries in a conda environment:- Bio
- io
- pathlib
- argparse
The scripts are written assuming they will be run on the Crick HPC. First run the pre-processing script to divide the SAVANA fasta by SV ID. The script takes two inputs, the fasta file with all inserted sequences from savana, and an output directory:
./preprocess_insert_sequence_fa.sh SAMPLE_ID.inserted_sequences.fa tmp_dir
Then, run the consensus script in an array job, each of which will run multiple sequences in a loop. This script takes two inputs, a tmp dir with the fasta sequences of each individual SV generated in step 1, and an output directory where the resulting consenus insert sequences will be saved
./run_consensus_test.array.sh tmp_dir out_dir