Skip to content
Hardik Shah edited this page Dec 2, 2016 · 1 revision

Welcome to the pbBarcoding wiki!

This page will cover how to de-convolute imported barcoded datasets on the new PacBio portal a.k. SMRTLink. I wrote this tutorial/tip wiki because I realized that there is no clear way to de-multiplex samples and perform secondary analysis on each de-multiplexed sample on the UI. These tools maybe available via command-line, but I personally favor having our lab folks having more control via the UI. Through my struggle of finding appropriate solutions I have received a lot of help from some PacBio geniuses namely, Lance Helper, Roberto Lleras and Richard Hall. So kudos to them !!

Importing the barcodes themselves into SMRTLink

  • The full 384 PacBio barcodes are available here
  • But when you make pool samples using a random list of barcodes from the full list of 384 barcodes, you still have to import them as barcodeset into SMRTLink. This is a very important step.
  • Once you have a fasta file of your pooled barcodes run the following steps. (Note: please make sure that $SMRT_ROOT is in your path )

STEP1 : create a fasta file just containing barcodes used to pool N samples

STEP2 : dataset create --type BarcodeSet --generateIndices --name name_your_barcode_pool name_your_barcode_pool.barcodeset.xml /path/to/your/pooled/barcodes/fasta

STEP3 : pbservice import-dataset --host smrtlink_url --port 9091 name_your_barcode_pool.barcodeset.xml


De-convoluting your data based on their barcodes

  • In SMRTLink UI, go to Data Management, look for your pooled dataset and click on Analyze
  • Give the job a name and choose Barcoding in analysis and select name_your_barcode_pool from the list of barcodesets available and click Submit

STEP4 : run the barcoding job on SMRTLink UI

Once this job finishes, you'll have a barcode-de-convoluted ( if such a word exists :) ) dataset imported automatically. It will be called NAME (Barcoded), NAME is the name of your original pooled dataset. There will also be a Barcode Report which shows how many reads were assigned to each barcode.

IMPORTANT:I realized (and you'll too), that in the barcode report, barcodes are just plainly listed a 0-indexed numbers. eg. If you used, lets say these 4 barcodes 7,11,43,21 to pool 4 samples, the internal codebase (and UI) treats them as a 0-indexed list : 0,1,2,3 ! While doing any further work, keep in mind that barcode 0 is actually the first barcode in the fasta file and so on ...

This reason for doing this exercise is that the resultant bam file from the Barcoding job is now barcoded !

On important variable to keep in mind here is the full path to the resultant subreadset.xml file. Which you will find in the jobs directory as /path/to/jobs/000/000NNN/tasks/pbcoretools.tasks.bam2bam_barcode-0/subreads_barcoded.subreadset.xml. This is the actual subreadset xml file of the barcoded subreads. I'll call this pool_barcoded_subreadset for reference.


Importing each barcoded sample individually

In the world of PacBio, you are either using symmetric or asymmetric barcodes, i.e the same barcode on both ends of the template or different ones. Here I am only playing with symmetric barcodes, which are usually passed as an array [0,0], [1,1], ... , [bcN,bcN]

The best way to explain this portion is way of examples:

Lets say your barcode pool is 7,11,43,21 , in SMRTLink they will be designated and addressed as 0,1,2,3 as I explained in the previous section. So now the following list of commands has to be performed 4 times ( or N , where N=no. of samples pooled ).

STEP5: (repeated N times , N= no. of samples )

(a) dataset --name SampleName SampleName.subreadset.xml pool_barcoded_subreadset.xml : this is to change the name of the dataset in the xml file and creates a new subreadset.xml file

(b) dataset filter SampleName.subreadset.xml SampleName.filtered.subreadset.xml 'bc==[0,0]' : this filters the pool_barcoded_subreadset for the 1st barcode, which in this example is barcode 7 and creates another subreadset.xml

Now things get more interesting. The in the new schema under SMRTLink, everything is treated a unique uuids. So when the original dataset got imported, it was assigned a unique uuid. The barcoded-dataset, post barcoding job is also assigned a new uuid.

At this point the uuid in SampleName.filtered.subreadset.xml and pool_barcoded_subreadset.xml is the same. So that has to be changed

(c) dataset newuuid SampleName.filtered.subreadset.xml

(d) pbservice import-dataset --host smrtlink_url --port 9091 SampleName.filtered.subreadset.xml

So basically the commands mentioned as (a),(b),(c) and (d) have to be repeated for each barcode and make sure to change bc==[0,0] to reflect the 0-index mapping of the original barcode list.

At this point, each of the samples from the pooled subreadset has been imported into SMRTLink and are accessible via Data Management. You can now select individual samples and perform several secondary analysis !

CAVEATS: I still need to explore changing the BioSample tags in the xml files to reflect the name of the sample and not the original Pooled subreadset name.