Nextflow DSL2 pipeline for assembly processing, variant discovery, TE analysis, gene/protein functional annotation, and group-specific variant interpretation.
main.nfworkflows/preprocess_assembly.nfworkflows/alignment_variant_calling.nf
- Read QC and trimming:
FastQC,fastp - Assembly/scaffolding:
megahit,ragtag - Coverage estimation:
bwa mem+samtoolsdepth metrics - Variant calling:
Bowtie2,bcftools,SnpEff - Callable genome workflow:
mosdepth+samtools depth+ consensus callable BED - Group-specific variants: strict per-group variant extraction
- TE analysis:
RepeatModeler,RepeatMasker(optionalMcClintock) - Gene/protein analyses:
Liftoff,antiSMASH,BLASTp(MEROPS),dbCAN,SIX BLASTp,TargetP,SignalP,WoLFPSort, optionalDeepTMHMM - Functional annotation for group-specific candidates: EggNOG + PHI-base BLASTp
CSV with header:
sample_id,read1,read2
34,/path/sample34_R1.fastq.gz,/path/sample34_R2.fastq.gz
35,/path/sample35_R1.fastq.gz,/path/sample35_R2.fastq.gzDefault: samplesheet.csv
TSV format:
# sample_id group
34 GroupB
35 GroupB
41 GroupA
42 GroupADefault: samplesheets/group_map.tsv
Configure paths in nextflow.config or via CLI:
- BLAST databases (nt/MEROPS/PHI-base)
- SIX query FASTA
- dbCAN database
- EggNOG data directory
- SnpEff DB/JAR
- Tool paths (Conda env/tool binaries)
source ~/miniconda3/etc/profile.d/conda.sh
conda activate /home/sercanozturk/miniconda3/envs/nf-env
nextflow run main.nf --samplesheet_path samplesheet.csv -with-report--output_dir--samplesheet_path--avc_group_map--avc_run_group_specific--avc_run_group_functional_annotation--enable_chr0_blast_append--run_mcclintock--skip_deeptmhmm
Under results/ (default):
coverage/callable/bcftools_callable/group_specific_variants/eggnog/six_blastp/dbcan/repeatmodeler/,repeatmasker/liftoff/,antismash/,blastp/,snpeff/
- Optional analysis modules can be enabled/disabled with config flags.
- Validate group map and database paths before production runs.