Redcarpet (Recombination Detection using Comparative Analysis of Regional Patterns of Exact Match Targets)
Redcarpet is a alignment-free recombination detection tool that utilizes genomic database distributions of exact protein matches. Redcarpet builds on the WhatsGNU method, which uses exact matching for identifying proteomic novelty.
Redcarpet takes in a single query genome, and for each encoded protein, determines the set of genomes in a database that contain an exact protein sequence match. It then computes the Jaccard similarity coefficient between genome sets for all pairwise protein comparisons in the genome.
WhatsGNU must first be run in order to get the hits file that will serve as the input for Redcarpet. To install WhatsGNU, see here.
Using a hashed database:
WhatsGNU_main_hashes.py -d $database_path -csv $file.csv -i --hash_values -o $output_directory query_faa/
whatsgnu_output/
├── GCF_000005845.2_prtn_id_hashes.csv
├── GCF_000005845.2_WhatsGNU_hits.txt
├── GCF_000005845.2_WhatsGNU_report.txt
└── WhatsGNU_20250108_115433.log
To run Redcarpet, use the following command:
python3 Redcarpet.py $WhatsGNU_hits.txt
usage: Redcarpet.py [-h] [-v] [-bk BOTTOM_K] [--hash_file HASH_FILE] ids_hits_file
Alignment-Free Recombination Detector
positional arguments:
ids_hits_file ids_hits_file from WhatsGNU -i option
options:
-h, --help show this help message and exit
-v, --version print version and exit
-bk BOTTOM_K, --bottom_k BOTTOM_K
bottom-k cutoff for hits (default: all hits are used)
--hash_file HASH_FILE
ids hash file for a WhatsGNU database
[Discuss what is meant by bottom_k option]
Once the RedCarpet Report has been generated by Redcarpet, you can start on the changepoint analysis step.
To process a single Redcarpet report, use the following command:
python3 CarpetCleanChangepoints.py --mode single -i $input_report -o $output_directory
To process multiple Redcarpet reports in a folder simultaneously:
python3 CarpetCleanChangepoints.py --mode batch --input_folder $input_folder -o $output_directory
-i, --input_heatmap : Path to the heatmap file generated by Redcarpet (required for single mode)
-input_folder : Path to folder containing multiple Redcarpet reports (required for batch mode)
-o, --output_directory : Directory to store outputs (optional - defaults to input file/folder directory)
--similarity_threshold : P-value threshold for determining region similarity (default: 0.05, lower = more strict)
--k_neighbors : Number of nearest neighbors to compare for efficiency (default: 5, higher = more comparisons)
--num_chunks : Number of chunks to split data into (default: 10)
The output will have different regions of the genome with the changepoints identified. Additionally, information about which regions are similar will be provided.
Already completed heatmaps and reports for all S.aureus and K.pneumoniae