Skip to content

rehrlich/ccs_smrt_pipe

Repository files navigation

ccs_smrt_pipe uses PacBio .bax.h5 files to create fastq input files for MCSMRT.


Required software:
Smrtanalysis version 2.3


Installation:
Modify the line:
SMRT_ROOT="/opt/smrtanalysis"
in the file ccs_job_maker.sh so instead of /opt/smrtanalysis it has the location of smrtanalysis on your computer.  If you type ls and this location you should see something like:
admin  common  current  install  smrtcmds  tmpdir  userdata
If you can't find it, ask the person who installed smrtanalysis where it is.
Optional:  Modify the header on ccs_job_maker.sh so you can submit it to sge.


Creating the barcode fasta file:
The set of barcodes you purchased should have a fasta file with their sequences.  Several can be found here:  https://github.com/PacificBiosciences/Bioinformatics-Training/tree/master/barcoding
The order of the barcodes in this file doesn't matter.  Extra barcodes are also fine.


Batch input mode:
This runs many SMRT pipe ccs jobs.  Jobs are defined by a tab separated file.  The first line of the file must include the following column headings in any order:  job_name	barcode_file	video_path	forward_barcode	reverse_barcode	outdir	sample_name	full_passes	pred_accuracy
Each line in the file is one sample and each column is a parameter for a ccs job.  Each job must include at least two samples.  The parameters are the same as the single sample parameters except for job name, sample name and the barcodes.  When multiple samples from the same SMRT cell can be demultiplexed using the same parameters, they should be given the same job name.  Otherwise, they must be given different names.  Lines with the same job name must contain the same values for all parameters except for the sample name and barcodes.
All barcodes for a single job name should be unique.  For example, if job1 has sample A with forward_barcode=x1 and reverse_barcode=x2 then sample B from job1 cannot have x1 for either of its barcodes.  A different job can have a sample with barcodes x1 and x2.
All jobs must have the same outidr.
All sample names must be unique.


Batch input file:
Usage: ccs_job_maker.sh batch [-h] -i INPUT_FILE

Optional arguments:
  -h, --help            show this help message and exit

Required arguments:
  -i INPUT_FILE, --input_file INPUT_FILE
                        batch input file


Detailed explanation of arguments:
OUTDIR - This is a path to a directory that does not currently exist.  The default option is to create a directory called smrt_pipe_ccs in your current directory.  This argument cannot include whitespace.
JOB_NAME - A nickname for the analysis.
FULL_PASSES - PacBio describes this as "The minimum number of full-length passes over the insert DNA for the read to be included."  It must be a whole number between zero and ten, inclusive.
PRED_ACCURACY - PacBio describes this as "The minimum predicted accuracy (in %) of the Reads of Insert emitted."  It must be a whole number between 70 and 100, inclusive.
BARCODE_FILE - This is the path to a fasta file with the barcodes used for the experiment.  This argument cannot include whitespace.
VIDEO_PATH - A directory containing the *.bax.h5 data files.  In the directory with the raw data from a pacbio cell, this the Analysis_Results folder.  All *.bax.h5 files in the directory will be used.  On my computer this path would be something like /data/pacbio/rawdata/run_name/A01_1/Analysis_Results.  This argument cannot include whitespace.
SAMPLE_NAME - A name for the sample.  All names in the file must be unique.  Sample names cannot end with '_no_ccs_count', and 'all_samples' is not a valid sample name.  This will be use to name the final fastq file.
FORWARD_BARCODE - The barcode name must be valid fasta headers from the barcode fasta file.  All barcodes for a job must be unique.
REVERSE_BARCODE - The other barcode for the sample.


Single sample analysis:
This program only works with samples with barcodes.  If your samples do not have barcodes, you will need to use smrt portal or smrt pipe to do the ccs job.
Smrt pipe requires at least two pairs of barcodes.  If you are only interested in one sample, you can add a second pair of barcodes to the barcode fasta file and use them to create a second sample.  These sequences should be the same length as the original barcodes.  The sequences should not be similar to the original barcodes.


Detailed explanation of output directory:
OUTDIR/fastq - This has the fastq output files.  The files are names are the sample names.  The header lines are the headers created by smrtpipe with ';ccs=X;barcodelabel=Y;' appended to the end.  X is the number of ccs passes that were used to generate that entry in the file.  Y is the sample name.
OUTDIR/job_name - These directories have all of the files created by smrt pipe.


Command line version:
This program can be run using command line arguments instead of an input file.  This is generally less efficient.  Instructions are in README_command_line.

All code is under a GPL license except the function _get_ccs_passes from the file count_passes.py which was modified from https://github.com/PacificBiosciences/Bioinformatics-Training/raw/master/scripts/ccs_passes.py and the settings_template.xml which was modified from a settings.xml file created using PacBio's SMRT portal.

About

Create fastq files for MCSMRT

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published