BRB‑seq Workflow

1. Demultiplexing and Species-Level Reorganization of Multi-Species Libraries

1.1 Demultiplex BRBseq library coming from mutiple species

As multiple species were multiplexed into a single library, we first need to demultiplex the library per sample and then re-concatenate the samples coming from the same species.

Each sample is identified by a 14 bp inline barcode at the start of Read 1 (R1), followed by a 16bp Unique Molecular Identifier (UMI). Read 2 (R2) contains the actual RNA sequence.

To demutiplex the library, I am using cutadapt. Cutadapt requires a FASTA file to read barcode sequences, e.g:

>Pluc_33_new
CCTTGCCAGAGTGG
>Pluc_34_new
CATTCCTCTAGCCG
>Pluc_35_new
ACTTGAGCAACCGG

You can use this command to help convert the CSV barcode file into a FASTA file

awk -F',' '{print ">"$1"\n"$2}' barcodes.csv > barcodes.fasta

Then use the cutadapt command with the follwing paramters:

cutadapt -j 4 --no-indels -e 0.1 --action=none \
  -g ^file:barcodes.fasta \
  -o demux/{name}_R1.fastq.gz -p demux/{name}_R2.fastq.gz \
  --untrimmed-output demux/unknown_R1.fastq.gz \
  --untrimmed-paired-output demux/unknown_R2.fastq.gz \
  input_R1.fastq.gz input_R2.fastq.gz

Parameters:

^: anchors the barcode to the start of R1.
--action=none: Do not trim barcodes; just demultiplex
--no-indels: for stricter matching (no gaps)
e 0.1 allows up to 1 mismatch for 12 bp barcodes
Output goes to demux/{sample}_R1/R2.fastq.gz
Unassigned reads go to unknown_R1/R2.fastq.gz
j 2 Use 2 threads for faster processing

1.2 Concatenate Samples by Species

After demultiplexing, individual samples can be grouped by species using a CSV file which contain the species name in the first column and the sample name in the second column. E.g:

Pluchei,Pluc_33_new
Pluchei,Pluc_34_new
Pluchei,Pluc_35_new
Hogna,Hogna 54
Hogna,Hogna 55
Hogna,Hogna 58
...

You can use the provided script concat_by_species.sh to concatenate the demultiplexed FASTQ files by species as follow:

./concat_by_species.sh <CSV_FILE> <DEMUX_DIR> <OUTPUT_DIR

CSV_FILE: CSV file mapping species (1st column) to sample names (2nd column)
DEMUX_DIR: Path to directory containig the demultiplexed FASTQ files
OUTPUT_DIR: Output directory for concatenated species files

2. BRB-seq Snakemake Workflow

This Snakemake pipeline processes BRB-seq data through several stages, including quality control with FastQC, species-specific genome indexing, read alignment using STARsolo with integrated barcode error correction, BAM file concatenation and sorting, gene annotation extension with GeneExt, re-alignment against the extended annotation, and final count matrix generation.

2.1 Prepare Inputs

Input tables: Before running the workflow, you must prepare two files: the BRBseq Library Table and the Species Table.
- 1. BRBseq Library Table (config["library_table"]). A CSV table containing information for all your samples. Each row correspond to one sample, and the following fields are required:
    - barcode: Barcode of each sample.
    - library: Library name. ⚠️ WARNING: Library name must match the raw read filename prefix!
    - sample: Sample name (optional for the workflow, but essential for analysis).
    - anatId: Anatomical entity (UBERON ID) (optional).
    - stageId: Developmental stage (UBERON ID) (optional).
    - speciesId: NCBI Taxonomy ID of the species.
- 1. Species Table (config["species_table"]) A CSV table containing information for each species. Each row should correspond to one species and must include the following fields:
    - speciesId: Must match the speciesId in the library table.
    - genome_id: Genome filename (located in config["genome_dir"]).
    - annotation_id: Annotation filename (located in config["annotation_dir"]).

Genome Annotation for GeneExt: The GeneExt tool requires a specific GTF file format. BRAKER2 outputs are not directly compatible with GeneExt. A provided Python script (scripts/reformat_gtf_for_GeneExt.py) can be used to reformat BRAKER2 GTF files to meet GeneExt’s requirements.

Script usage:
```
./reformat_gtf_for_GeneExt.py <input.gtf> <output.gtf> <removed_genes.txt>
```
- The script will:
  - Ensures that gene features include a gene_id attribute.
  - Ensures that transcript features include both gene_id and transcript_id attributes.
  - Reorders attributes in exon, CDS, intron (and similar) lines so that gene_id appears before transcript_id.
  - It also removes genes without any exon, and print their id in a text file named <removed_gene.txt>.

2.2 Edit the Configuration File

All paths and parameters are defined in config/config.yaml. Key fields:

BRBseq_data: Directory containing raw FASTQ files.
annotation_dir: Directory with genome annotation files.
genome_dir: Directory with genome FASTA files.
library_table: Path to your BRB-seq library table.
species_table: Path to your species table.
whitelist: Path to the barcode whitelist .txt file.
snakemake_dir_path: Root directory of the workflow.

Important

The barcode whitelist (whitelist) is mandatory and is typically provided by the sequencing center.
Place all FASTQ files under config["BRBseq_data"]. Files must follow this naming pattern: - [libraryname]_R1.fastq.gz - [libraryname]_R2.fastq.gz

2.3 Run Snakemake

Launch the pipeline with:

snakemake --cores <num_cores> --use-conda

Pipeline Overview

Step	Description
FastQC	Quality control for R1 and R2 FASTQ files.
Genome Indexing	STAR builds species-specific genome indexes.
STARsolo Alignment	Align reads with STARsolo using genome indexes and perform barcode correction.
BAM Concatenation & Sorting	BAM files per species are concatenated, sorted, and indexed.
Gene Annotation Extension	GeneExt extends annotations based on mapped reads.
STARsolo Re-alignment	Reads are realigned using the updated annotation from GeneExt.
Count Matrix Generation	An R script converts STARsolo outputs into final count matrices.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
config		config
data_files		data_files
demutiplex_per_species		demutiplex_per_species
reformat_gtf_geneExt		reformat_gtf_geneExt
workflow		workflow
README.md		README.md
snake_BRBseq.run		snake_BRBseq.run

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BRB‑seq Workflow

1. Demultiplexing and Species-Level Reorganization of Multi-Species Libraries

1.1 Demultiplex BRBseq library coming from mutiple species

1.2 Concatenate Samples by Species

2. BRB-seq Snakemake Workflow

2.1 Prepare Inputs

2.2 Edit the Configuration File

2.3 Run Snakemake

Pipeline Overview

About

Uh oh!

Releases

Packages

Languages

sdind/BRB-seq_workflow

Folders and files

Latest commit

History

Repository files navigation

BRB‑seq Workflow

1. Demultiplexing and Species-Level Reorganization of Multi-Species Libraries

1.1 Demultiplex BRBseq library coming from mutiple species

1.2 Concatenate Samples by Species

2. BRB-seq Snakemake Workflow

2.1 Prepare Inputs

2.2 Edit the Configuration File

2.3 Run Snakemake

Pipeline Overview

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages