Skip to content

Inquiry Regarding CLUES Software Usage and Pipeline Optimization #15

@aaannaw

Description

@aaannaw

Dear Authors,

I am currently using your CLUES software to estimate the time of selection for a set of SNPs in two species, A and B, which diverged approximately 3 million years ago (as estimated by dadi based on demographic history). We have identified 2038 SNPs under selection in species B relative to species A within a specific genomic region (chromosome 1, 160Mb-193Mb) using xpnsl. Our goal is to calculate the time of selection for these 2038 SNPs.

We have generated the necessary input files for CLUES using Relate, but I have encountered a few questions and issues during the process. I would greatly appreciate your guidance on the following points:

VCF File Handling: Our VCF file was generated by aligning resequencing data from 12 individuals of species A and 14 individuals of species B to the genome of species A. Should we split the VCF file by species and only use the VCF file for species B when converting to haps and sample files using ConvertFromVcf? Currently, I have used the combined VCF file containing all individuals from both species, specifying the different species in the poplabel file. Is this approach correct?

Relate All Mode: When using Relate's All mode to generate mut and anc files from the haps and sample files, I noticed that a coal file is required, which in turn requires an effective population size parameter. Is this effective population size referring to the current effective population size? Can I use the current effective population size estimated from dadi's demographic history simulation? Additionally, how critical is it to provide a coal file at this stage, as I currently do not have one?

EstimatePopulationSize.sh: I proceeded to use EstimatePopulationSize.sh with the mut and anc files as input to generate the coal files. This step produced two coal files: one treating all samples as a single population and another considering pairwise comparisons (pairwise.coal). Is this expected?

SampleBranchLengths.sh: When attempting to use SampleBranchLengths.sh to generate the input files for CLUES, I encountered an issue. The script requires the coal file that treats all samples as a single population, but using this file resulted in an error when trying to generate the timb file (--format b). Interestingly, the pairwise.coal file successfully generated the timb file. I am using version v1.2.2. Is this a known issue, or am I missing something in the process?

CLUES Inference: For testing purposes, I continued with the pipeline using the timb file generated from the pairwise.coal file as input for CLUES. I ran the following command:

python3 /data/01/p1/user157/software/clues-master/clues-master/inference.py --times ./1sub --tCutoff 3000000 --out 3000000
I set the cutoff time to 3 million years (assuming one generation per year), but the process is running very slowly. Is this expected, or are there ways to optimize the speed of the analysis?

In summary, I would like to confirm if my pipeline is correct and if there are any recommendations to improve the efficiency of the CLUES analysis. Your insights would be invaluable in ensuring the accuracy and efficiency of our work.

Thank you very much for your time and assistance. I look forward to your response.

Best regards,
Na Wan

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions