As multiple species were multiplexed into a single library, we first need to demultiplex the library per sample and then re-concatenate the samples coming from the same species.
Each sample is identified by a 14 bp inline barcode at the start of Read 1 (R1), followed by a 16bp Unique Molecular Identifier (UMI). Read 2 (R2) contains the actual RNA sequence.
To demutiplex the library, I am using cutadapt. Cutadapt requires a FASTA file to read barcode sequences, e.g:
>Pluc_33_new
CCTTGCCAGAGTGG
>Pluc_34_new
CATTCCTCTAGCCG
>Pluc_35_new
ACTTGAGCAACCGG
You can use this command to help convert the CSV barcode file into a FASTA file
awk -F',' '{print ">"$1"\n"$2}' barcodes.csv > barcodes.fastaThen use the cutadapt command with the follwing paramters:
cutadapt -j 4 --no-indels -e 0.1 --action=none \
-g ^file:barcodes.fasta \
-o demux/{name}_R1.fastq.gz -p demux/{name}_R2.fastq.gz \
--untrimmed-output demux/unknown_R1.fastq.gz \
--untrimmed-paired-output demux/unknown_R2.fastq.gz \
input_R1.fastq.gz input_R2.fastq.gz
Parameters:
^: anchors the barcode to the start of R1.--action=none: Do not trim barcodes; just demultiplex--no-indels: for stricter matching (no gaps)e 0.1allows up to 1 mismatch for 12 bp barcodes- Output goes to demux/{sample}_R1/R2.fastq.gz
- Unassigned reads go to unknown_R1/R2.fastq.gz
j 2Use 2 threads for faster processing
After demultiplexing, individual samples can be grouped by species using a CSV file which contain the species name in the first column and the sample name in the second column. E.g:
Pluchei,Pluc_33_new
Pluchei,Pluc_34_new
Pluchei,Pluc_35_new
Hogna,Hogna 54
Hogna,Hogna 55
Hogna,Hogna 58
...You can use the provided script concat_by_species.sh to concatenate the demultiplexed FASTQ files by species as follow:
./concat_by_species.sh <CSV_FILE> <DEMUX_DIR> <OUTPUT_DIR
CSV_FILE: CSV file mapping species (1st column) to sample names (2nd column)DEMUX_DIR: Path to directory containig the demultiplexed FASTQ filesOUTPUT_DIR: Output directory for concatenated species files
This Snakemake pipeline processes BRB-seq data through several stages, including quality control with FastQC, species-specific genome indexing, read alignment using STARsolo with integrated barcode error correction, BAM file concatenation and sorting, gene annotation extension with GeneExt, re-alignment against the extended annotation, and final count matrix generation.
- Input tables:
Before running the workflow, you must prepare two files: the BRBseq Library Table and the Species Table.
-
- BRBseq Library Table (
config["library_table"]). A CSV table containing information for all your samples. Each row correspond to one sample, and the following fields are required:barcode: Barcode of each sample.library: Library name.⚠️ WARNING: Library name must match the raw read filename prefix!sample: Sample name (optional for the workflow, but essential for analysis).anatId: Anatomical entity (UBERON ID) (optional).stageId: Developmental stage (UBERON ID) (optional).speciesId: NCBI Taxonomy ID of the species.
- BRBseq Library Table (
-
- Species Table (
config["species_table"]) A CSV table containing information for each species. Each row should correspond to one species and must include the following fields:speciesId: Must match thespeciesIdin the library table.genome_id: Genome filename (located inconfig["genome_dir"]).annotation_id: Annotation filename (located inconfig["annotation_dir"]).
- Species Table (
-
-
Genome Annotation for GeneExt: The GeneExt tool requires a specific GTF file format. BRAKER2 outputs are not directly compatible with GeneExt. A provided Python script (
scripts/reformat_gtf_for_GeneExt.py) can be used to reformat BRAKER2 GTF files to meet GeneExt’s requirements.Script usage:
./reformat_gtf_for_GeneExt.py <input.gtf> <output.gtf> <removed_genes.txt>
- The script will:
- Ensures that gene features include a gene_id attribute.
- Ensures that transcript features include both gene_id and transcript_id attributes.
- Reorders attributes in exon, CDS, intron (and similar) lines so that gene_id appears before transcript_id.
- It also removes genes without any exon, and print their id in a text file named <removed_gene.txt>.
- The script will:
All paths and parameters are defined in config/config.yaml. Key fields:
BRBseq_data: Directory containing raw FASTQ files.annotation_dir: Directory with genome annotation files.genome_dir: Directory with genome FASTA files.library_table: Path to your BRB-seq library table.species_table: Path to your species table.whitelist: Path to the barcode whitelist.txtfile.snakemake_dir_path: Root directory of the workflow.
Important
- The barcode whitelist (
whitelist) is mandatory and is typically provided by the sequencing center. - Place all FASTQ files under
config["BRBseq_data"]. Files must follow this naming pattern: -[libraryname]_R1.fastq.gz-[libraryname]_R2.fastq.gz
Launch the pipeline with:
snakemake --cores <num_cores> --use-conda| Step | Description |
|---|---|
| FastQC | Quality control for R1 and R2 FASTQ files. |
| Genome Indexing | STAR builds species-specific genome indexes. |
| STARsolo Alignment | Align reads with STARsolo using genome indexes and perform barcode correction. |
| BAM Concatenation & Sorting | BAM files per species are concatenated, sorted, and indexed. |
| Gene Annotation Extension | GeneExt extends annotations based on mapped reads. |
| STARsolo Re-alignment | Reads are realigned using the updated annotation from GeneExt. |
| Count Matrix Generation | An R script converts STARsolo outputs into final count matrices. |