This a completely new GHRU Assembly Pipeline and is designed for assembling genomic data based on a provided samplesheet. The pipeline supports different assemblers depending on the type of sequencing reads provided.
The pipeline is built as parth of GHRU project funded by NIHR.
- Julio Diaz Caballero @juliofdiaz julio.diaz@cgps.group
- Varun Shammana @varunshamanna varunshamanna4@gmail.com
- Angela Sofia Garcia @as-garciav agarciav@agrosavia.co
- Christopher Ocampo @arsp-dev christopher.ocampo@ritm.gov.ph
- Nabil-Fareed Alikhan @happykhan nabil.alikhan@cgps.group
- Short Reads Only: Assembled using Shovill.
- Long Reads Only: Assembled using Dragonflye.
- Both Long and Short Reads: Assembled using Unicycler.
Clone the repository
git clone https://github.com/cgps-discovery/GHRU-assembly.git
or
Download and unzip/extract the latest release.
Define where the pipeline should find input data and save output data.
| Parameter | Description | Type | Default | Required | Hidden |
|---|---|---|---|---|---|
samplesheet |
Input sample sheet, as csv file | string |
${launchDir}/samplesheet.csv | ||
outdir |
The output directory where the results will be saved. You have to use absolute paths to storage on Cloud infrastructure. | string |
${launchDir}/output |
| Parameter | Description | Type | Default | Required | Hidden |
|---|---|---|---|---|---|
max_memory |
Maximum memory to use for each process | string |
16.GB | True | |
max_cpus |
Maximum number of CPUs to use for each process | integer |
10 | True | |
max_time |
Maximum time to use for each process | string |
10.h | True | |
adapter_file |
Adapter file for trimming | string |
${projectDir}/data/adapters.fasta | True | |
min_contig_length |
Minimum contig length to keep | integer |
500 | True | |
medaka_model |
Medaka model to use | string |
r941_e81_fast_g514 | True |
To run the pipeline with the samplesheet and output path, use:
nextflow run main.nf --samplesheet /path/to/samplesheet.csv --output /path/to/output
- Prepare the Samplesheet: Ensure your CSV samplesheet contains the necessary fields for your samples. The required format is as follows:
sample_id,short_reads1,short_reads2,long_reads,genome_sizeAn example sample sheet has been provided in the project directory
- Run the Pipeline: Execute the Nextflow command with the appropriate arguments:
nextflow run main.nf --samplesheet test_input/samplesheet.csv -resume- The pipeline is primarily built for Linux operating systems (e.g., Linux, Windows with [WSL](https://en.wikipedia.org/wiki/Windows_Subsystem_for_Linux)).
- Currently, hybrid assembly and long-read assembly are not supported on macOS due to compatibility issues with Medaka on some macOS versions and MacBook models.
- Nextflow version 24 or higher is required.
- Docker must be running, with 10 cores and 16GB RAM allocated in Docker Desktop if using Windows WSL.
It is recommended to have at least 10 cores and 16GB of RAM and 50GB of free storage
- File Not Found Errors: Ensure that all specified file paths (samplesheet, output directory, medaka model, adapter file) are correct and accessible.
- Permission Issues: Verify that you have the necessary permissions to read the input files and write to the output directory.