5.2 RepeatMasker

RepeatMasker Overview

RepeatMasker is a program designed to screen DNA sequences for interspersed repeats and low complexity DNA sequences. The output of the program provides a detailed annotation of the repeats present in the query sequence, along with a modified version of the query sequence in which all annotated repeats are masked (default: replaced by Ns). Sequence comparisons in RepeatMasker are performed using the cross_match program, an efficient implementation of the Smith-Waterman-Gotoh algorithm developed by Phil Green, or by WU-Blast developed by Warren Gish.

Main Functions of RepeatMasker

1. Verification and Input Sequence Check

RepeatMasker verifies its own configuration by loading and checking all binary paths found in the RepeatMaskerConfig.pm module. It initializes parameters such as the size of fragments to be processed with the alignment tool and loads the date for creating the output directory for each run.

The second task is to check the input sequence. The FASTADB.pm module loads and checks parameters such as the number of FASTA sequences, length, G and C nucleotide count, GC ratio, and more. It ensures the validity of the FASTA sequence, splitting sequences into short overlapping fragments with a length of 2000 bp.

2. Database Check and Preparation

RepeatMasker checks the database file containing repeat elements (transposable elements and satellites). Users can either use the RepeatMasker library from the Genetic Information Research Institute or provide a custom library. The RepbaseEMBL.pm, RepbaseRecord.pm, and PubRef.pm modules extract information about repeats from the library.

3. Sequence Splitting and Tool Execution

RepeatMasker splits the sequence(s) into fragments and prepares a list of execution for different fragments. It then launches the alignment tool on the sequences. The search engine output is converted into RepeatMasker standard output.

4. Output Format Conversion

After the runs of the search engine, RepeatMasker converts different outputs into standard output, specifically cross_match output.

ProcessRepeats Handling

Following RepeatMasker application, ProcessRepeats organizes and processes results by assembling and sorting fragmented repeats. The main algorithm includes reading options, loading information data, removing duplicated hits, assembling fragmented sequences, sorting repeats hits, removing insignificant fragments, and writing optional annotation output files.

DateRepeats, DupMasker, and RepeatMasker Library

DateRepeats: Analyzes if a repeat present in one species is expected to be present in another one, especially for mammalian species.
DupMasker: Annotates segmental duplications in query sequence, creating a '.duplicons' output file.
RepeatMasker Library: Users can use the RepeatMasker library downloaded from Genetic Information Research Institute or provide their own library, which has an EMBL-like format.

Search Engine Software

1. AB-BLAST (Old WU-BLAST)

AB-BLAST is an alternative implementation of BLAST, acquired from Washington University by Warren R. Gish. It is free for academic use and uses the same core algorithm as WU-BLAST.

2. RMBLAST

RMBlast is a RepeatMasker compatible version of the standard NCBI BLAST, supporting custom matrices and cross_match-like complexity-adjusted scoring.

3. Cross_Match

Cross_Match, implemented by Phil Green, is part of the Phred/Phrap/Consed package. It is free for academic use and is the reference search engine in output format.

4. Decypher

DeCypher (DeCypherSW) belongs to "TimeLogic biocomputing solutions" and, if found in RepeatMasker configuration, is used as a search engine tool.

Conclusion

RepeatMasker provides a comprehensive solution for annotating interspersed repeats and low complexity DNA sequences in DNA sequences. The process involves multiple steps, including configuration verification, input sequence checks, database inspection, sequence splitting, tool execution, output format conversion, and subsequent handling of results through ProcessRepeats.

For more detailed information, refer to the official RepeatMasker documentation and resources.

References

Official RepeatMasker Documentation: [https://www.repeatmasker.org/]
Tempel S. (2012). Using and understanding RepeatMasker. Methods in molecular biology (Clifton, N.J.), 859, 29–51. https://doi.org/10.1007/978-1-61779-603-6_2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

5.2 RepeatMasker

RepeatMasker Overview

Main Functions of RepeatMasker

1. Verification and Input Sequence Check

2. Database Check and Preparation

3. Sequence Splitting and Tool Execution

4. Output Format Conversion

ProcessRepeats Handling

DateRepeats, DupMasker, and RepeatMasker Library

Search Engine Software

1. AB-BLAST (Old WU-BLAST)

2. RMBLAST

3. Cross_Match

4. Decypher

Conclusion

References

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally