-
Notifications
You must be signed in to change notification settings - Fork 0
Create fastq files for MCSMRT
License
rehrlich/ccs_smrt_pipe
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
ccs_smrt_pipe uses PacBio .bax.h5 files to create fastq input files for MCSMRT. Required software: Smrtanalysis version 2.3 Installation: Modify the line: SMRT_ROOT="/opt/smrtanalysis" in the file ccs_job_maker.sh so instead of /opt/smrtanalysis it has the location of smrtanalysis on your computer. If you type ls and this location you should see something like: admin common current install smrtcmds tmpdir userdata If you can't find it, ask the person who installed smrtanalysis where it is. Optional: Modify the header on ccs_job_maker.sh so you can submit it to sge. Creating the barcode fasta file: The set of barcodes you purchased should have a fasta file with their sequences. Several can be found here: https://github.com/PacificBiosciences/Bioinformatics-Training/tree/master/barcoding The order of the barcodes in this file doesn't matter. Extra barcodes are also fine. Batch input mode: This runs many SMRT pipe ccs jobs. Jobs are defined by a tab separated file. The first line of the file must include the following column headings in any order: job_name barcode_file video_path forward_barcode reverse_barcode outdir sample_name full_passes pred_accuracy Each line in the file is one sample and each column is a parameter for a ccs job. Each job must include at least two samples. The parameters are the same as the single sample parameters except for job name, sample name and the barcodes. When multiple samples from the same SMRT cell can be demultiplexed using the same parameters, they should be given the same job name. Otherwise, they must be given different names. Lines with the same job name must contain the same values for all parameters except for the sample name and barcodes. All barcodes for a single job name should be unique. For example, if job1 has sample A with forward_barcode=x1 and reverse_barcode=x2 then sample B from job1 cannot have x1 for either of its barcodes. A different job can have a sample with barcodes x1 and x2. All jobs must have the same outidr. All sample names must be unique. Batch input file: Usage: ccs_job_maker.sh batch [-h] -i INPUT_FILE Optional arguments: -h, --help show this help message and exit Required arguments: -i INPUT_FILE, --input_file INPUT_FILE batch input file Detailed explanation of arguments: OUTDIR - This is a path to a directory that does not currently exist. The default option is to create a directory called smrt_pipe_ccs in your current directory. This argument cannot include whitespace. JOB_NAME - A nickname for the analysis. FULL_PASSES - PacBio describes this as "The minimum number of full-length passes over the insert DNA for the read to be included." It must be a whole number between zero and ten, inclusive. PRED_ACCURACY - PacBio describes this as "The minimum predicted accuracy (in %) of the Reads of Insert emitted." It must be a whole number between 70 and 100, inclusive. BARCODE_FILE - This is the path to a fasta file with the barcodes used for the experiment. This argument cannot include whitespace. VIDEO_PATH - A directory containing the *.bax.h5 data files. In the directory with the raw data from a pacbio cell, this the Analysis_Results folder. All *.bax.h5 files in the directory will be used. On my computer this path would be something like /data/pacbio/rawdata/run_name/A01_1/Analysis_Results. This argument cannot include whitespace. SAMPLE_NAME - A name for the sample. All names in the file must be unique. Sample names cannot end with '_no_ccs_count', and 'all_samples' is not a valid sample name. This will be use to name the final fastq file. FORWARD_BARCODE - The barcode name must be valid fasta headers from the barcode fasta file. All barcodes for a job must be unique. REVERSE_BARCODE - The other barcode for the sample. Single sample analysis: This program only works with samples with barcodes. If your samples do not have barcodes, you will need to use smrt portal or smrt pipe to do the ccs job. Smrt pipe requires at least two pairs of barcodes. If you are only interested in one sample, you can add a second pair of barcodes to the barcode fasta file and use them to create a second sample. These sequences should be the same length as the original barcodes. The sequences should not be similar to the original barcodes. Detailed explanation of output directory: OUTDIR/fastq - This has the fastq output files. The files are names are the sample names. The header lines are the headers created by smrtpipe with ';ccs=X;barcodelabel=Y;' appended to the end. X is the number of ccs passes that were used to generate that entry in the file. Y is the sample name. OUTDIR/job_name - These directories have all of the files created by smrt pipe. Command line version: This program can be run using command line arguments instead of an input file. This is generally less efficient. Instructions are in README_command_line. All code is under a GPL license except the function _get_ccs_passes from the file count_passes.py which was modified from https://github.com/PacificBiosciences/Bioinformatics-Training/raw/master/scripts/ccs_passes.py and the settings_template.xml which was modified from a settings.xml file created using PacBio's SMRT portal.
About
Create fastq files for MCSMRT
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published