Some questions/problems concerning the program pipeline

Hi eyay,

there are some things with the posted pipeline that I do not understand / do not work for me and it would be great if you could tell me, what I'm eventually doing wrong.

1) Prerequisite "GenomeFetch.py": Looking briefly into the script it doesn't seem to support hg38 which you were using for the mapping. For what do you need the script?

2) Remove UMI from reads: What is the difference between your script (remove_umi_from_fastq_v4.py) and the one from the umi_tools package (umi_tools extract)? The parallelization doesn't give me any speed advantage on the data?! Additionally, your output is a fq, why not compressing it which is supported by umi_tools extract?

3) It seems that you are using also multi mapped reads in later steps of the analysis, don't you? If so, why can you use those and why didn't you change the outFilterMultimapNmax parameter from Star, as everything above 20 multi hits will be flagged as unmapped?!

3-4) Star mapping and clipped read removal: I'm missing a lot of file name conversions here. From the star command, you are creating generic output sam files, but in the clipped read removal step, you require bam files that are split into unique and multi-mappings with very rigid naming convention. Though I could create files that satisfy the name requirements (I split them based on the NH field and compress and sort them afterwards), I'm getting now an error I do not understand:

`
python /usr/local/smallseq/src/remove_softclipped_reads_parallelized_v2.py -i 01_Mapping -o 01_Mapping_clipped -p 6 -n 0 -t 3
Traceback (most recent call last):
  File "/usr/local/smallseq/src/remove_softclipped_reads_parallelized_v2.py", line 93, in <module>
    Parallel(n_jobs=o.numCPU)(delayed(remove_clipped_reads)(sample) for sample in samplenames_with_fullpath)
  File "/usr/local/lib/python2.7/dist-packages/joblib/parallel.py", line 768, in __call__
    self.retrieve()
  File "/usr/local/lib/python2.7/dist-packages/joblib/parallel.py", line 719, in retrieve
    raise exception
joblib.my_exceptions.JoblibSamtoolsError: JoblibSamtoolsError
___________________________________________________________________________
Multiprocessing exception:
...........................................................................
/usr/local/smallseq/src/remove_softclipped_reads_parallelized_v2.py in <module>()
     88 		samplenames_with_fullpath.append(multi_bam)
     89 
     90 		path_outbam = os.path.join(o.outstardir, sample)
     91 		if not os.path.exists(path_outbam): safe_mkdir(path_outbam)
     92 
---> 93 	Parallel(n_jobs=o.numCPU)(delayed(remove_clipped_reads)(sample) for sample in samplenames_with_fullpath)
     94 
     95 
     96 
     97 

...........................................................................
/usr/local/lib/python2.7/dist-packages/joblib/parallel.py in __call__(self=Parallel(n_jobs=6), iterable=<generator object <genexpr>>)
    763             if pre_dispatch == "all" or n_jobs == 1:
    764                 # The iterable was consumed all at once by the above for loop.
    765                 # No need to wait for async callbacks to trigger to
    766                 # consumption.
    767                 self._iterating = False
--> 768             self.retrieve()
        self.retrieve = <bound method Parallel.retrieve of Parallel(n_jobs=6)>
    769             # Make sure that we get a last message telling us we are done
    770             elapsed_time = time.time() - self._start_time
    771             self._print('Done %3i out of %3i | elapsed: %s finished',
    772                         (len(self._output), len(self._output),

---------------------------------------------------------------------------
Sub-process traceback:
---------------------------------------------------------------------------
SamtoolsError                                      Fri Nov 25 10:47:38 2016
PID: 12961                                    Python 2.7.6: /usr/bin/python
...........................................................................
/usr/local/lib/python2.7/dist-packages/joblib/parallel.py in __call__(self=<joblib.parallel.BatchedCalls object>)
    126     def __init__(self, iterator_slice):
    127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        func = <function remove_clipped_reads>
        args = ('01_Mapping/SRR3495785/SRR3495785_unique.bam',)
        kwargs = {}
        self.items = [(<function remove_clipped_reads>, ('01_Mapping/SRR3495785/SRR3495785_unique.bam',), {})]
    132 
    133     def __len__(self):
    134         return self._size
    135 

...........................................................................
/usr/local/smallseq/src/remove_softclipped_reads_parallelized_v2.py in remove_clipped_reads(inbam='01_Mapping/SRR3495785/SRR3495785_unique.bam')
     56 
     57 		outbam.write(read)
     58 		reads_after_clip_removal += 1
     59 	outbam.close()
     60 	#sort and index the final bam file
---> 61 	pysam.sort(bam_out, bam_out_sorted)     
        bam_out = '01_Mapping_clipped/SRR3495785/SRR3495785_unique_tmp.bam'
        bam_out_sorted = '01_Mapping_clipped/SRR3495785/SRR3495785_unique'
     62 	pysam.index(bam_out_sorted+".bam", template=inbamPysamObj)
     63 
     64 	print reads_after_clip_removal, total_reads
     65 	print "%s percentage of removed reads = %.2f" %(p[-1], 100*(1 -reads_after_clip_removal/total_reads))

...........................................................................
/usr/local/lib/python2.7/dist-packages/pysam/utils.py in __call__(self=<pysam.utils.PysamDispatcher object>, *args=('01_Mapping_clipped/SRR3495785/SRR3495785_unique_tmp.bam', '01_Mapping_clipped/SRR3495785/SRR3495785_unique'), **kwargs={})
     70                 "%s returned with error %i: "
     71                 "stdout=%s, stderr=%s" %
     72                 (self.collection,
     73                  retval, 
     74                  stdout,
---> 75                  stderr))
        stderr = '[bam_sort] Use -T PREFIX / -o FILE to specify te...            Reference sequence FASTA FILE [null]\n'
     76 
     77         self.stderr = stderr
     78 
     79         # call parser for stdout:

SamtoolsError: 'samtools returned with error 1: stdout=, stderr=[bam_sort] Use -T PREFIX / -o FILE to specify temporary and final output files\nUsage: samtools sort [options...] [in.bam]\nOptions:\n  -l INT     Set compression level, from 0 (uncompressed) to 9 (best)\n  -m INT     Set maximum memory per thread; suffix K/M/G recognized [768M]\n  -n         Sort by read name\n  -o FILE    Write final output to FILE rather than standard output\n  -T PREFIX  Write temporary files to PREFIX.nnnn.bam\n  -@, --threads INT\n             Set number of sorting and compression threads [1]\n      --input-fmt-option OPT[=VAL]\n               Specify a single input file format option in the form\n               of OPTION or OPTION=VALUE\n  -O, --output-fmt FORMAT[,OPT[=VAL]]...\n               Specify output format (SAM, BAM, CRAM)\n      --output-fmt-option OPT[=VAL]\n               Specify a single output file format option in the form\n               of OPTION or OPTION=VALUE\n      --reference FILE\n               Reference sequence FASTA FILE [null]\n'
___________________________________________________________________________
`


Why is samtools sort called again? And why is it looking for a reference fasta - a header is supplied?

Any help is much appreciated.
Best regards

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some questions/problems concerning the program pipeline #1

Sub-process traceback:

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Some questions/problems concerning the program pipeline #1

Description

Sub-process traceback:

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions