SLAPCHOP

SLAPCHOP.py parses Illumina reads using patterns to extract barcodes.

By using fuzzy regular expressions we can chop robustly barcodes from indeterminate positions, filter the results based on sequence or match properties, and reassemble a fastq record from the results.

Available as a singularity containter! So you if you have Singularity installed you can just use it (without worrying about dependencies) with: singularity run shub://darachm/slapchop:latest -h (to download and show the argument help message for example). Then you use it like singularity run shub://darachm/slapchop:latest whatever.fastq arguments added

More completely, this tool is a python script. You give it a FASTQ(Z) file and some operations to do, and it'll do the following:

- read chunks of Illumina-format sequencing reads
- apply a series of operations:
    - match fuzzy regular expression to original sequence or previous
        capture groups
    - extract capture groups and start next operation
- apply pythonic filters (pass/fail) on sequence or quality properties 
    (like average quality or group length)
- apply pythonic constructors to construct new FASTQ read from the capture
    groups (so ID plus the last four bases of the UMI plus length of whatever)
- write out these reads to new files of pass and fail

For tuning/debugging/designing it has some verbosity modes to spill the gory details of each operation in stats files, and should still have some memory profiling functionality to debug memory leaks (fixed that one).

For a very very verbose example of a debugging run:

./slapchop.py \
    input.fastqz -z \
    output_basename \
    --bite-size 10 --processes 3  \
    --write-report --limit 10000 \
    -o "Sample:  input   > (?P<sample>[ATCG]{5})(?P<fixed1>GTCCACGAGGTC){e<=1}(?P<rest>TCT.*){e<=1}" \
    -o "Strain:  rest    > (?P<tag>TCT){e<=1}(?P<strain>[ATCG]{10,26})CGTACGCTGCAGGTCGAC"  \
    -o "UMITail: rest    > (?P<fixed2>CGTACGCTGCAGGTC)(?<UMItail>GAC[ATCG]G[ATCG]A[ATCG]G[ATCG]G[ATCG]G[ATCG]GAT){s<=2}"  \
    -o "UMI:     UMItail > (GAC(?P<umi1>[ATCG])G(?<umi2>[ATCG])A(?<umi3>[ATCG])G(?<umi4>[ATCG])G(?<umi5>[ATCG])G(?<umi6>[ATCG])G){e<=2}"  \
    --output-seq "strain" \
    --output-id "input.id+'_umi='+umi1.seq+umi2.seq+umi3.seq+ \
        umi4.seq+umi5.seq+umi6.seq+'_sample='+sample.seq" \
    --filter "sample_length == 5 and rest_start >= 16 and ( min(strain.letter_annotations['phred_quality']) >= 30 )"\
    --verbose --verbose --verbose

That invocation:

- Takes records from the `input.fastq`
- Starts three processes that each take bites of 10 records
- Applies the four operations to cut up the read
- Writes the full detailed report including json reports for 
    each read, so we limit it to the first 10,000 bytes
    of the file (about 50 records). This is for debugging.
- Filters the records on having a `sample` barcode of 5 bases 
    and having the `rest` sequence match starting at least past
    index 16 (so base 15 in english).
- Re-assembles the records that pass this filter, making the ID
    of the fastq record having the original ID plus a UMI 
    sequence and the sample barcode, then the sequence is just
    the match to the strain barcode context. This is suitable for
    feeding into `bwa` for example.
- We've got three levels of verbosity, so a per-record verbosity
    for debugging purposes.

Note that the output formatting is a bit funny. This is directly evaluated (because security is what?) on BioPython SequenceRecords, so you need to specify just the name of the capture group(s) for the outputs so it can access the .seq and qualities. For the ID, etc, you can access .seq or .id.

Then if we like our thresholds we'd re-run, and drop the --limit and --write-report flags. This will turn records like this:

@NB501157:100:H5J5LBGX2:1:11101:10000:6068 1:N:0:
CTACTGTCCACGAGGTCTCTGCAGATAATACACTGTCACCCGTACGCTGCAGGTCGACCGTAGGAGGGAGATGTG
+
AAAAAEEEE/AEE<EEEEEEEEAEEAEEAEEEEE/EEE/EEEEEEEEE/EEEEEEEEEEEEE/EEEEEEEEEEEE

into records like this:

@NB501157:100:H5J5LBGX2:1:11101:10000:6068_umi=CTGAGA_sample=CTACT
GCAGATAATACACTGTCACC
+
EEAEEAEEAEEEEE/EEE/E

The sample barcode is the first five, the strain barcode starts after the TCT, and the UMI is interspersed downstream. This is modified yeast BarSeq, btw.

This script depends strongly upon (uses) the work of regex and Biopython. Thanks! Check them out...

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
example-data		example-data
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
Singularity.v0.3.0-alpha		Singularity.v0.3.0-alpha
Singularity.v0.5.0-alpha		Singularity.v0.5.0-alpha
Singularity.v0.5.1-alpha		Singularity.v0.5.1-alpha
slapchop.py		slapchop.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SLAPCHOP

About

Uh oh!

Releases 4

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

License

darachm/slapchop

Folders and files

Latest commit

History

Repository files navigation

SLAPCHOP

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages