-
Notifications
You must be signed in to change notification settings - Fork 2
5.2 RepeatMasker
RepeatMasker is a program designed to screen DNA sequences for interspersed repeats and low complexity DNA sequences. The output of the program provides a detailed annotation of the repeats present in the query sequence, along with a modified version of the query sequence in which all annotated repeats are masked (default: replaced by Ns). Sequence comparisons in RepeatMasker are performed using the cross_match program, an efficient implementation of the Smith-Waterman-Gotoh algorithm developed by Phil Green, or by WU-Blast developed by Warren Gish.
RepeatMasker verifies its own configuration by loading and checking all binary paths found in the RepeatMaskerConfig.pm module. It initializes parameters such as the size of fragments to be processed with the alignment tool and loads the date for creating the output directory for each run.
The second task is to check the input sequence. The FASTADB.pm module loads and checks parameters such as the number of FASTA sequences, length, G and C nucleotide count, GC ratio, and more. It ensures the validity of the FASTA sequence, splitting sequences into short overlapping fragments with a length of 2000 bp.
RepeatMasker checks the database file containing repeat elements (transposable elements and satellites). Users can either use the RepeatMasker library from the Genetic Information Research Institute or provide a custom library. The RepbaseEMBL.pm, RepbaseRecord.pm, and PubRef.pm modules extract information about repeats from the library.
RepeatMasker splits the sequence(s) into fragments and prepares a list of execution for different fragments. It then launches the alignment tool on the sequences. The search engine output is converted into RepeatMasker standard output.
After the runs of the search engine, RepeatMasker converts different outputs into standard output, specifically cross_match output.
Following RepeatMasker application, ProcessRepeats organizes and processes results by assembling and sorting fragmented repeats. The main algorithm includes reading options, loading information data, removing duplicated hits, assembling fragmented sequences, sorting repeats hits, removing insignificant fragments, and writing optional annotation output files.
-
DateRepeats: Analyzes if a repeat present in one species is expected to be present in another one, especially for mammalian species.
-
DupMasker: Annotates segmental duplications in query sequence, creating a '.duplicons' output file.
-
RepeatMasker Library: Users can use the RepeatMasker library downloaded from Genetic Information Research Institute or provide their own library, which has an EMBL-like format.
AB-BLAST is an alternative implementation of BLAST, acquired from Washington University by Warren R. Gish. It is free for academic use and uses the same core algorithm as WU-BLAST.
RMBlast is a RepeatMasker compatible version of the standard NCBI BLAST, supporting custom matrices and cross_match-like complexity-adjusted scoring.
Cross_Match, implemented by Phil Green, is part of the Phred/Phrap/Consed package. It is free for academic use and is the reference search engine in output format.
DeCypher (DeCypherSW) belongs to "TimeLogic biocomputing solutions" and, if found in RepeatMasker configuration, is used as a search engine tool.
RepeatMasker provides a comprehensive solution for annotating interspersed repeats and low complexity DNA sequences in DNA sequences. The process involves multiple steps, including configuration verification, input sequence checks, database inspection, sequence splitting, tool execution, output format conversion, and subsequent handling of results through ProcessRepeats.
For more detailed information, refer to the official RepeatMasker documentation and resources.
- Official RepeatMasker Documentation: [https://www.repeatmasker.org/]
- Tempel S. (2012). Using and understanding RepeatMasker. Methods in molecular biology (Clifton, N.J.), 859, 29–51. https://doi.org/10.1007/978-1-61779-603-6_2