diff --git a/README.md b/README.md index ba00c599..9d1a1380 100644 --- a/README.md +++ b/README.md @@ -36,26 +36,26 @@ On release, automated continuous integration tests run the pipeline on a full-si > to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) > with `-profile test` before running the workflow on actual data. -First, prepare a samplesheet with your input data that looks as follows: +First, prepare an SDRF (Sample and Data Relationship Format) file with your input data. SDRF is a standardized format used in proteomics that contains rich metadata about samples. Here's a simplified example: -`samplesheet.tsv` +`sdrf_file.tsv` -```tsv title="samplesheet.tsv -ID Sample Condition ReplicateFileName -1 tumor treated /path/to/msrun1.raw|mzML|d -2 tumor treated /path/to/msrun2.raw|mzML|d -3 tumor untreated /path/to/msrun3.raw|mzML|d -4 tumor untreated /path/to/msrun4.raw|mzML|d +```tsv title="sdrf_file.tsv +source name characteristics[organism] characteristics[organism part] comment[data file] comment[technical replicate] factor value[organism part] +sample1_classI Homo sapiens Brain /path/to/sample1_classI_techRep1.mzML 1 Brain +sample1_classI Homo sapiens Brain /path/to/sample1_classI_techRep2.mzML 2 Brain +sample2_classI Homo sapiens Liver /path/to/sample2_classI_techRep1.mzML 1 Liver +sample2_classI Homo sapiens Liver /path/to/sample2_classI_techRep2.mzML 2 Liver ``` Each row represents a mass spectrometry run in one of the formats: raw, RAW, mzML, mzML.gz, d, d.tar.gz, d.zip -Now, you can run the pipeline using: +Now, you can run the pipeline using your SDRF file: ```bash nextflow run nf-core/mhcquant -profile \ - --input 'samplesheet.tsv' \ + --input 'sdrf_file.tsv' \ --fasta 'SWISSPROT_2020.fasta' \ --outdir ./results ``` @@ -71,6 +71,7 @@ For more details and further functionality, please refer to the [usage documenta By default the pipeline currently performs identification of MHC class I peptides with HCD settings: +- Converting SDRF file to internal format (`SDRF_CONVERT`) - Preparing spectra dependent on the input format (`PrepareSpectra`) - Creation of reversed decoy database (`DecoyDatabase`) - Identification of peptides in the MS/MS spectra (`CometAdapter`) diff --git a/assets/example_sdrf.tsv b/assets/example_sdrf.tsv new file mode 100644 index 00000000..4cb3aed7 --- /dev/null +++ b/assets/example_sdrf.tsv @@ -0,0 +1,10 @@ +source name characteristics[organism] characteristics[organism part] characteristics[cell type] characteristics[cell line] characteristics[developmental stage] characteristics[disease] characteristics[ancestry category] characteristics[sex] characteristics[age] characteristics[biological replicate] characteristics[mhc type] material type assay name technology type comment[data file] comment[file uri] comment[technical replicate] comment[fraction identifier] comment[label] comment[cleavage agent details] comment[instrument] comment[modification parameters] comment[dissociation method] comment[precursor mass tolerance] comment[fragment mass tolerance] factor value[organism part] +sample1_classII Homo sapiens Brain not applicable not applicable adult none not available not available not available 1 A*11:01;A*68:01;B*15:01;B*35:03;C*03:03;C*04:01;DRB1*04:01;DRB4*01:01;DQB1*03:01;DQA1*03:01;DPB1*04:01;DPB1*04:02 tissue none proteomic profiling by mass spectrometry /path/to/sample1_classII_techRep1.mzML TODO 1 1 AC=MS:1002038;NT=label free sample NT=unspecific cleavage; AC=MS:1001956 NT=timsTOF Pro 2;AC=MS:1003230 NT=Oxidation;MT=variable;TA=M;AC=UNIMOD:35;PP=Anywhere AC=MS:1000133;NT=CID 20 ppm 0.02 Da Brain +sample1_classII Homo sapiens Brain not applicable not applicable adult none not available not available not available 1 A*11:01;A*68:01;B*15:01;B*35:03;C*03:03;C*04:01;DRB1*04:01;DRB4*01:01;DQB1*03:01;DQA1*03:01;DPB1*04:01;DPB1*04:02 tissue none proteomic profiling by mass spectrometry /path/to/sample1_classII_techRep2.mzML TODO 2 1 AC=MS:1002038;NT=label free sample NT=unspecific cleavage; AC=MS:1001956 NT=timsTOF Pro 2;AC=MS:1003230 NT=Oxidation;MT=variable;TA=M;AC=UNIMOD:35;PP=Anywhere AC=MS:1000133;NT=CID 20 ppm 0.02 Da Brain +sample1_classII Homo sapiens Brain not applicable not applicable adult none not available not available not available 1 A*11:01;A*68:01;B*15:01;B*35:03;C*03:03;C*04:01;DRB1*04:01;DRB4*01:01;DQB1*03:01;DQA1*03:01;DPB1*04:01;DPB1*04:02 tissue none proteomic profiling by mass spectrometry /path/to/sample1_classII_techRep3.mzML TODO 3 1 AC=MS:1002038;NT=label free sample NT=unspecific cleavage; AC=MS:1001956 NT=timsTOF Pro 2;AC=MS:1003230 NT=Oxidation;MT=variable;TA=M;AC=UNIMOD:35;PP=Anywhere AC=MS:1000133;NT=CID 20 ppm 0.02 Da Brain +sample1_classI Homo sapiens Brain not applicable not applicable adult none not available not available not available 1 A*11:01;A*68:01;B*15:01;B*35:03;C*03:03;C*04:01;DRB1*04:01;DRB4*01:01;DQB1*03:01;DQA1*03:01;DPB1*04:01;DPB1*04:02 tissue none proteomic profiling by mass spectrometry /path/to/sample1_classI_techRep1.mzML TODO 1 1 AC=MS:1002038;NT=label free sample NT=unspecific cleavage; AC=MS:1001956 NT=timsTOF Pro 2;AC=MS:1003230 NT=Oxidation;MT=variable;TA=M;AC=UNIMOD:35;PP=Anywhere AC=MS:1000133;NT=CID 20 ppm 0.02 Da Brain +sample1_classI Homo sapiens Brain not applicable not applicable adult none not available not available not available 1 A*11:01;A*68:01;B*15:01;B*35:03;C*03:03;C*04:01;DRB1*04:01;DRB4*01:01;DQB1*03:01;DQA1*03:01;DPB1*04:01;DPB1*04:02 tissue none proteomic profiling by mass spectrometry /path/to/sample1_classI_techRep2.mzML TODO 2 1 AC=MS:1002038;NT=label free sample NT=unspecific cleavage; AC=MS:1001956 NT=timsTOF Pro 2;AC=MS:1003230 NT=Oxidation;MT=variable;TA=M;AC=UNIMOD:35;PP=Anywhere AC=MS:1000133;NT=CID 20 ppm 0.02 Da Brain +sample2_classII Homo sapiens Liver not applicable not applicable adult none not available not available not available 2 A*02:01;A*24:02;B*07:02;B*40:01;C*03:04;C*07:02;DRB1*15:01;DRB5*01:01;DQB1*06:02;DQA1*01:02;DPB1*04:01;DPB1*04:02 tissue none proteomic profiling by mass spectrometry /path/to/sample2_classII_techRep1.mzML TODO 1 1 AC=MS:1002038;NT=label free sample NT=unspecific cleavage; AC=MS:1001956 NT=timsTOF Pro 2;AC=MS:1003230 NT=Oxidation;MT=variable;TA=M;AC=UNIMOD:35;PP=Anywhere AC=MS:1000133;NT=CID 20 ppm 0.02 Da Liver +sample2_classII Homo sapiens Liver not applicable not applicable adult none not available not available not available 2 A*02:01;A*24:02;B*07:02;B*40:01;C*03:04;C*07:02;DRB1*15:01;DRB5*01:01;DQB1*06:02;DQA1*01:02;DPB1*04:01;DPB1*04:02 tissue none proteomic profiling by mass spectrometry /path/to/sample2_classII_techRep2.mzML TODO 2 1 AC=MS:1002038;NT=label free sample NT=unspecific cleavage; AC=MS:1001956 NT=timsTOF Pro 2;AC=MS:1003230 NT=Oxidation;MT=variable;TA=M;AC=UNIMOD:35;PP=Anywhere AC=MS:1000133;NT=CID 20 ppm 0.02 Da Liver +sample2_classI Homo sapiens Liver not applicable not applicable adult none not available not available not available 2 A*02:01;A*24:02;B*07:02;B*40:01;C*03:04;C*07:02;DRB1*15:01;DRB5*01:01;DQB1*06:02;DQA1*01:02;DPB1*04:01;DPB1*04:02 tissue none proteomic profiling by mass spectrometry /path/to/sample2_classI_techRep1.mzML TODO 1 1 AC=MS:1002038;NT=label free sample NT=unspecific cleavage; AC=MS:1001956 NT=timsTOF Pro 2;AC=MS:1003230 NT=Oxidation;MT=variable;TA=M;AC=UNIMOD:35;PP=Anywhere AC=MS:1000133;NT=CID 20 ppm 0.02 Da Liver +sample2_classI Homo sapiens Liver not applicable not applicable adult none not available not available not available 2 A*02:01;A*24:02;B*07:02;B*40:01;C*03:04;C*07:02;DRB1*15:01;DRB5*01:01;DQB1*06:02;DQA1*01:02;DPB1*04:01;DPB1*04:02 tissue none proteomic profiling by mass spectrometry /path/to/sample2_classI_techRep2.mzML TODO 2 1 AC=MS:1002038;NT=label free sample NT=unspecific cleavage; AC=MS:1001956 NT=timsTOF Pro 2;AC=MS:1003230 NT=Oxidation;MT=variable;TA=M;AC=UNIMOD:35;PP=Anywhere AC=MS:1000133;NT=CID 20 ppm 0.02 Da Liver \ No newline at end of file diff --git a/bin/sdrf_to_mhcquant.py b/bin/sdrf_to_mhcquant.py new file mode 100755 index 00000000..2e559e5f --- /dev/null +++ b/bin/sdrf_to_mhcquant.py @@ -0,0 +1,94 @@ +#!/usr/bin/env python + +""" +Script to convert SDRF files to MHCquant-compatible samplesheet format. +""" + +import argparse +import pandas as pd +import sys +import os +from sdrf.sdrf import SdrfDataFrame + +def parse_args(): + """Parse command line arguments.""" + parser = argparse.ArgumentParser(description='Convert SDRF files to MHCquant-compatible samplesheet format.') + parser.add_argument('-s', '--sdrf', required=True, help='Path to SDRF file') + parser.add_argument('-o', '--output', required=True, help='Path to output samplesheet file') + parser.add_argument('-v', '--version', action='version', version='%(prog)s 1.0') + return parser.parse_args() + +def convert_sdrf_to_mhcquant(sdrf_file, output_file): + """ + Convert SDRF file to MHCquant-compatible samplesheet format. + + Parameters: + ----------- + sdrf_file : str + Path to SDRF file + output_file : str + Path to output samplesheet file + """ + try: + # Parse SDRF file + sdrf_df = SdrfDataFrame.parse_sdrf(sdrf_file) + + # Create output dataframe with required columns + output_df = pd.DataFrame(columns=['ID', 'Sample', 'Condition', 'ReplicateFileName']) + + # Extract necessary information from SDRF + for idx, row in sdrf_df.iterrows(): + # Get source name as sample + sample = row.get('source name', f'sample_{idx}') + + # Get condition from factor value or characteristics + condition = None + for col in row.index: + if col.startswith('factor value['): + condition = row[col] + break + + if condition is None: + # If no factor value is found, use organism part or other characteristic + for col in row.index: + if col.startswith('characteristics[organism part]'): + condition = row[col] + break + + if condition is None: + # Default condition if nothing else is found + condition = 'unknown' + + # Get file path + file_path = None + for col in row.index: + if col == 'comment[data file]': + file_path = row[col] + break + + if file_path is None: + continue + + # Add to output dataframe + output_df = output_df._append({ + 'ID': idx + 1, + 'Sample': sample, + 'Condition': condition, + 'ReplicateFileName': file_path + }, ignore_index=True) + + # Write to output file + output_df.to_csv(output_file, sep='\t', index=False) + print(f"Successfully converted {sdrf_file} to {output_file}") + + except Exception as e: + print(f"Error converting SDRF file: {e}", file=sys.stderr) + sys.exit(1) + +def main(): + """Main function.""" + args = parse_args() + convert_sdrf_to_mhcquant(args.sdrf, args.output) + +if __name__ == '__main__': + main() \ No newline at end of file diff --git a/docs/usage.md b/docs/usage.md index 4d021fe4..3ed3af3b 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -4,41 +4,44 @@ > _Documentation of pipeline parameters is generated automatically from the pipeline schema and can no longer be found in markdown files._ -## Samplesheet input +## SDRF input -You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It has to be a tab-separated file with 4 columns, and a header row as shown in the examples below. +You will need to create an SDRF (Sample and Data Relationship Format) file with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. SDRF is a standardized format used in proteomics that contains rich metadata about samples. ```bash ---input '[path to samplesheet file]' +--input '[path to SDRF file]' ``` -### Samplesheet columns +### SDRF format -| Column | Description | -| ------------------- | ----------------------------------------------------------------------------------------------------- | -| `ID` | An incrementing value which acts as a unique number for the given sample | -| `Sample` | Custom sample name. This entry will be identical for multiple MS runs from the same sample. | -| `Condition` | Additional information of the sample can be defined here. | -| `ReplicateFileName` | Full path to the MS file. These files have the extentions .raw, .mzML, mzML.gz, .d, .d.tar.gz, .d.zip | +SDRF files contain detailed sample metadata in a tabular format with specific columns for sample characteristics, data file locations, and experimental factors. The pipeline uses sdrf-pipelines to process this file and extract the necessary information. -The pipeline will auto-detect whether a sample is either in mzML, raw or tdf file format using the information provided in the samplesheet. +Key columns in an SDRF file include: -An [example samplesheet](../assets/samplesheet.tsv) has been provided with the pipeline. +| Column | Description | +| --------------------------- | ----------------------------------------------------------------------------------------------------- | +| `source name` | Custom sample name. This entry will be identical for multiple MS runs from the same sample. | +| `characteristics[organism]` | The organism from which the sample was derived | +| `comment[data file]` | Full path to the MS file. These files have the extentions .raw, .mzML, mzML.gz, .d, .d.tar.gz, .d.zip | +| `factor value[...]` | Experimental factors that can be used for grouping samples | +| `characteristics[mhc type]` | MHC alleles present in the sample (important for MHC peptide analysis) | + +The pipeline will auto-detect whether a sample is either in mzML, raw or tdf file format using the information provided in the SDRF file. + +An [example SDRF file](../assets/example_sdrf.tsv) has been provided with the pipeline. ### Multiple runs of the same sample -MS runs are merged on the `Sample` and `Condition` identifier combination before they are rescored with `Percolator`. Typically technical replicates of a sample are merged together to report one peptide list per sample. Below is an example of two runs from a treated and untreated tumor sample. - -```tsv title="samplesheet.tsv -ID Sample Condition ReplicateFileName -1 tumor treated /path/to/msrun1.raw|mzML|d -2 tumor treated /path/to/msrun2.raw|mzML|d -3 tumor untreated /path/to/msrun3.raw|mzML|d -4 tumor untreated /path/to/msrun4.raw|mzML|d -5 control treated /path/to/msrun5.raw|mzML|d -6 control treated /path/to/msrun6.raw|mzML|d -7 control untreated /path/to/msrun7.raw|mzML|d -8 control untreated /path/to/msrun8.raw|mzML|d +MS runs are merged based on the `source name` and factor values before they are rescored with `Percolator`. Typically technical replicates of a sample are merged together to report one peptide list per sample. + +For example, in the provided example SDRF file, there are multiple technical replicates for each sample (e.g., sample1_classI, sample1_classII), and these will be merged together during processing. + +### Creating SDRF files + +You can create SDRF files manually or use tools like the [SDRF Composer](https://github.com/bigbio/sdrf-composer) to generate them. The sdrf-pipelines tool also provides validation functionality to ensure your SDRF file is correctly formatted: + +```bash +parse_sdrf.py validate -s your_sdrf_file.tsv ``` ## Recommended search settings @@ -74,7 +77,7 @@ The typical command for running the pipeline is as follows: ```console nextflow run nf-core/mhcquant \ - --input 'samplesheet.tsv' \ + --input 'sdrf_file.tsv' \ --outdir \ --fasta 'SWISSPROT_2020.fasta' \ \ @@ -111,7 +114,7 @@ nextflow run nf-core/mhcquant -profile docker -params-file params.yaml with: ```yaml title="params.yaml" -input: './samplesheet.csv' +input: './sdrf_file.tsv' outdir: './results/' <...> ``` @@ -232,4 +235,3 @@ We recommend adding the following line to your environment to limit this (typica ```bash NXF_OPTS='-Xms1g -Xmx4g' -``` diff --git a/modules/local/sdrf_convert.nf b/modules/local/sdrf_convert.nf new file mode 100644 index 00000000..3d89aeb1 --- /dev/null +++ b/modules/local/sdrf_convert.nf @@ -0,0 +1,28 @@ +process SDRF_CONVERT { + tag "$sdrf" + label 'process_low' + + conda "bioconda::sdrf-pipelines=0.0.20" + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/sdrf-pipelines:0.0.20--pyhdfd78af_0' : + 'biocontainers/sdrf-pipelines:0.0.20--pyhdfd78af_0' }" + + input: + path sdrf + + output: + path "samplesheet.tsv", emit: samplesheet + path "versions.yml", emit: versions + + script: + """ + parse_sdrf.py convert-mhcquant \\ + -s $sdrf \\ + -o samplesheet.tsv + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + sdrf-pipelines: \$(parse_sdrf.py --version | sed 's/parse_sdrf.py version //g') + END_VERSIONS + """ +} \ No newline at end of file diff --git a/modules/local/sdrf_convert/environment.yml b/modules/local/sdrf_convert/environment.yml new file mode 100644 index 00000000..06c6a204 --- /dev/null +++ b/modules/local/sdrf_convert/environment.yml @@ -0,0 +1,9 @@ +name: sdrf_convert +channels: + - conda-forge + - bioconda + - defaults +dependencies: + - bioconda::sdrf-pipelines=0.0.20 + - conda-forge::pandas=1.3.5 + - python=3.8 \ No newline at end of file diff --git a/modules/local/sdrf_convert/main.nf b/modules/local/sdrf_convert/main.nf new file mode 100644 index 00000000..8ed3a16f --- /dev/null +++ b/modules/local/sdrf_convert/main.nf @@ -0,0 +1,29 @@ +process SDRF_CONVERT { + tag "$sdrf" + label 'process_low' + + conda "bioconda::sdrf-pipelines=0.0.20 conda-forge::pandas=1.3.5 python=3.8" + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/sdrf-pipelines:0.0.20--pyhdfd78af_0' : + 'biocontainers/sdrf-pipelines:0.0.20--pyhdfd78af_0' }" + + input: + path sdrf + + output: + path "samplesheet.tsv", emit: samplesheet + path "versions.yml", emit: versions + + script: + """ + python ${projectDir}/bin/sdrf_to_mhcquant.py \\ + -s $sdrf \\ + -o samplesheet.tsv + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + sdrf-pipelines: \$(python -c "import pkg_resources; print(pkg_resources.get_distribution('sdrf-pipelines').version)") + python: \$(python --version | sed 's/Python //g') + END_VERSIONS + """ +} \ No newline at end of file diff --git a/modules/local/sdrf_convert/meta.yml b/modules/local/sdrf_convert/meta.yml new file mode 100644 index 00000000..3125dc92 --- /dev/null +++ b/modules/local/sdrf_convert/meta.yml @@ -0,0 +1,33 @@ +name: sdrf_convert +description: Converts SDRF files to MHCquant-compatible samplesheet format +keywords: + - sdrf + - convert + - samplesheet +tools: + - sdrf-pipelines: + description: A set of pipelines for extracting, processing and validating SDRF files + homepage: https://github.com/bigbio/sdrf-pipelines + documentation: https://github.com/bigbio/sdrf-pipelines + tool_dev_url: https://github.com/bigbio/sdrf-pipelines + doi: "" + licence: ["Apache-2.0"] + +input: + - sdrf: + type: file + description: SDRF file + pattern: "*.{tsv}" + +output: + - samplesheet: + type: file + description: Converted samplesheet in MHCquant-compatible format + pattern: "*.{tsv}" + - versions: + type: file + description: File containing software versions + pattern: "versions.yml" + +authors: + - "@nf-core/mhcquant" \ No newline at end of file diff --git a/nextflow_schema.json b/nextflow_schema.json index 26b89b41..3904e565 100644 --- a/nextflow_schema.json +++ b/nextflow_schema.json @@ -14,13 +14,13 @@ "properties": { "input": { "type": "string", - "description": "Input raw / mzML files listed in a tsv file (see help for details)", - "help_text": "Use this to specify a sample sheet table including your input raw or mzml files as well as their meta information such as SampleID and Condition. For example:\n\n| ID | Sample | Condition | ReplicateFileName |\n| -----|:------------:| ----------:|------------------------------------------:|\n| 1 | MM15_Melanom | A | data/MM15_Melanom_W_1_A_standard.raw |\n| 2 | MM15_Melanom | B | data/MM15_Melanom_W_1_B_standard.raw |\n| 3 | MM17_Melanom | B | data/MM17_Melanom_W_1_B_standard.raw |\n\n```bash\n--input 'path/samples.tsv'\n```", + "description": "Input SDRF file containing sample metadata and file paths", + "help_text": "Use this to specify an SDRF (Sample and Data Relationship Format) file containing your sample metadata and file paths. SDRF is a standardized format used in proteomics that contains rich metadata about samples. Example columns include 'source name', 'characteristics[organism]', 'comment[data file]', and 'factor value[condition]'. The pipeline will use sdrf-pipelines to process this file and extract the necessary information for the workflow.\n\nFor more information about SDRF format, visit: https://github.com/bigbio/sdrf-pipelines", "format": "file-path", "exists": true, "schema": "assets/schema_input.json", "mimetype": "text/csv", - "pattern": "^\\S+\\.tsv$", + "pattern": "^\\S+\\.(tsv|sdrf)$", "fa_icon": "fas fa-file" }, "outdir": { diff --git a/subworkflows/local/utils_nfcore_mhcquant_pipeline/main.nf b/subworkflows/local/utils_nfcore_mhcquant_pipeline/main.nf index 22d001c6..63a792c1 100644 --- a/subworkflows/local/utils_nfcore_mhcquant_pipeline/main.nf +++ b/subworkflows/local/utils_nfcore_mhcquant_pipeline/main.nf @@ -16,6 +16,7 @@ include { completionSummary } from '../../nf-core/utils_nfcore_pipeline' include { imNotification } from '../../nf-core/utils_nfcore_pipeline' include { UTILS_NFCORE_PIPELINE } from '../../nf-core/utils_nfcore_pipeline' include { UTILS_NEXTFLOW_PIPELINE } from '../../nf-core/utils_nextflow_pipeline' +include { SDRF_CONVERT } from '../../../modules/local/sdrf_convert/main' /* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -31,7 +32,7 @@ workflow PIPELINE_INITIALISATION { monochrome_logs // boolean: Do not use coloured log outputs nextflow_cli_args // array: List of positional nextflow CLI args outdir // string: The output directory where the results will be saved - input // string: Path to input samplesheet + input // string: Path to input SDRF file main: @@ -64,18 +65,31 @@ workflow PIPELINE_INITIALISATION { ) // - // Create channel from input file provided through params.input + // Process SDRF file and convert to samplesheet format // + SDRF_CONVERT(input) + ch_versions = ch_versions.mix(SDRF_CONVERT.out.versions) + // + // Create channel from converted samplesheet + // Channel - .fromList(samplesheetToList(params.input, "${projectDir}/assets/schema_input.json")) - .map { meta, file -> [meta.subMap('sample','condition'), meta, file] } + .fromPath(SDRF_CONVERT.out.samplesheet) + .splitCsv(header: true, sep: '\t') + .map { row -> + def meta = [ + id: row.ID, + sample: row.Sample, + condition: row.Condition + ] + [ meta, file(row.ReplicateFileName) ] + } .tap { ch_input } .groupTuple() // get number of files per sample-condition - .map { group_meta, metas, files -> [ group_meta, files.size()] } + .map { meta, files -> [ meta, files.size()] } .combine( ch_input, by:0 ) - .map { group_meta, group_count, meta, file -> [meta + ['group_count':group_count, 'spectra':file.baseName.tokenize('.')[0], 'ext':getCustomExtension(file)], file] } + .map { meta, group_count, file -> [meta + ['group_count':group_count, 'spectra':file.baseName.tokenize('.')[0], 'ext':getCustomExtension(file)], file] } .set { ch_samplesheet } // @@ -88,6 +102,7 @@ workflow PIPELINE_INITIALISATION { emit: samplesheet = ch_samplesheet fasta = ch_fasta + versions = ch_versions } /* @@ -182,6 +197,7 @@ def toolCitationText() { "MapAligner (Weisser et al. 2013)", "FeatureFinder (Weisser et al. 2017)", "MultiQC (Ewels et al. 2016)", + "SDRF-Pipelines (BigBio)", "." ].join(' ').trim() @@ -199,7 +215,8 @@ def toolBibliographyText() { "
  • Käll, L., Canterbury, J., Weston, J. et al. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat Methods 4, 923–925 (2007). doi: /10.1038/nmeth1113
  • ", "
  • Hendrik Weisser, Sven Nahnsen, Jonas Grossmann et al. An Automated Pipeline for High-Throughput Label-Free Quantitative Proteomics. Journal of Proteome Research 2013 12 (4), 1628-1644. doi: 10.1021/pr300992u
  • ", "
  • Hendrik Weisser and Jyoti S. Choudhary, Journal of Proteome Research 2017 16 (8), 2964-2974. doi: /10.1021/acs.jproteome.7b00248
  • ", - "
  • Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics , 32(19), 3047–3048. doi: /10.1093/bioinformatics/btw354
  • " + "
  • Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics , 32(19), 3047–3048. doi: /10.1093/bioinformatics/btw354
  • ", + "
  • SDRF-Pipelines: A set of pipelines for extracting, processing and validating SDRF files. https://github.com/bigbio/sdrf-pipelines
  • " ].join(' ').trim() return reference_text @@ -236,4 +253,3 @@ def methodsDescriptionText(mqc_methods_yaml) { return description_html.toString() } -