Read-counts-per-gene step error- likely stemming from seq2gene.tab file? 

Good evening all, 

I want to apologize in advance for the extremely long thread, but I am so new and uncertain with all of this that I wanted to make sure everything I've done is crystal clear in case I've made any errors along the way. My project seeks to understand differential gene expression in diseased Pisaster ochraceus sea stars. I am attempting to get a pipeline up and running for 90+ tagseq samples with a smaller subset of 3 samples. As I'm not sure where I've gone wrong, I am going to describe every step from the raw reads to where I've hit a roadblock that has completely stopped me in my tracks. 

For reference, I am working on the [Sapelo2 Linux cluster](https://wiki.gacrc.uga.edu/wiki/Systems) at the University of Georgia and mainly following along with your tutorial [tagSeq_processing_README.txt](https://github.com/z0on/tag-based_RNAseq/blob/master/tagSeq_processing_README.txt).

### Step 1. Prepare workspace and concatenate raw reads
Step 1A. Copy raw reads into working directory. Raw read pairs have been renamed to reflect the following format: `sample_*_L1.fastq.gz` and `sample_*_L2.fastq.gz`.
Step 1B. Unzip raw read files.
`$ gunzip sample_*.fastq.gz`
Step 1C. Make sure `ngs_concat.pl` script is in working directory. 
Step 1D. Load perl module, run script to concatenate paired reads
`$ ml Perl/5.32.1-GCCcore-10.3.0`
`$ perl ngs_concat.pl 'sample' 'sample_(.+)_L'`

### Step 2. Adaptor trimming
Note: _At the time I did not understand "screen" + piping to the file named `clean`, so I wrote each adaptor trimming line manually. (I figured out how to do this for the `maps` step, so I will do it the "right way" when I scale up to the whole genome._

Step 2A. Make sure `tagseq_clipper.pl` script is in working directory. 

Step 2B. Run code to trim reads. The three samples I'm working with are named `10.8.fq`, `1.15.fq`, and `11.4.fq` at this stage.
`$ ml cutadapt/3.4-GCCcore-8.3.0-Python-3.7.4`
`$ perl tagseq_clipper.pl 10.8.fq | cutadapt - -a AAAAAAAA -a AGATCGG -q 15 -m 25 -o 10.8.trim`
`$ perl tagseq_clipper.pl 1.15.fq | cutadapt - -a AAAAAAAA -a AGATCGG -q 15 -m 25 -o 1.15.trim`
`$ perl tagseq_clipper.pl 11.4.fq | cutadapt - -a AAAAAAAA -a AGATCGG -q 15 -m 25 -o 11.4.trim`

### Step 3. Create transcriptome from reference genome.
I am fortunate that I'm working with a species for which there is a published reference genome. After a lot of research on the internet, this is what I came up with to create my _in silico_ transcriptome. I do not know for sure whether or not I've done this correctly- as you'll see, I ran into a few snags. 

Step 3A. Download reference genome into working directory. 
`$ wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/010/994/315/GCA_010994315.2_ASM1099431v2/GCA_010994315.2_ASM1099431v2_genomic.fna.gz` 
`$ gunzip GCA_010994315.2_ASM1099431v2_genomic.fna.gz`
`$ mv GCA_010994315.2_ASM1099431v2_genomic.fna.gz pisaster_ref_genome.fa` # rename file

Step 3B. Get annotation file into working directory. 
Download annotation file from colleague's Dryad: https://datadryad.org/stash/dataset/doi:10.6071/M3ND50
Upload annotation file to Sapelo2 in working directory, rename `genome_annotation.gff3`.

Step 3C. Troubleshoot and troubleshoot some more.
I was getting the same error when I tried many different variations of code that used `gffread` to create transcriptome from the genome and annotation files. Got involved in a [Github issue](https://github.com/gpertea/gffread/issues/94) in the repository for `gffread` that matched the error I was getting… and it led to me getting my answer: there are spaces in my ref genome sequence names and there should not be. 

An example of what the first and last sequence names in my file look like: 
>`>CM021546.1 Pisaster ochraceus isolate M0D055189C chromosome XXI, whole genome shotgun sequence`
>`>JAAFGO010000999.1 Pisaster ochraceus isolate M0D055189C Sc28pcJ_999;HRSCAF=1079, whole genome shotgun sequence`

Spent an extraordinarily long time trying to figure out some code that could efficiently fix this for me and failed. Shamefully, I instead landed on creating a shell script that renamed each contig individually (eg `sed -i 's/CM021526.1/Sc28pcJ_649/g' pisaster_ref_genome_header.mod.fa` for contig 1). But hey, it looks like it worked.

Then, I reran my transcriptome building step (`gffread -F -w pisaster_transcriptome.fa -g pisaster_ref_genome_header.mod.fa ann_simple.gff`) and got a new error: "GffObj::getSpliced() error: improper genomic coordinate 17535992 on Sc28pcJ_1844 for g8980.t1".  I was frustrated so I just deleted that line, and 5 or 6 other lines that gave me similar issues. After that, it finally worked! As an example, the first two entries in my new `pisaster_transcriptome.fa` look like this:
>`>g1.t1 CDS=1-277 geneID=g1
atgctctctaccaataggaatagcgaaactgtctggggaGGAGGCAAGAAGTCCCCAGAGAACAATGTTT
TGAACAAGGTGTTGGAGGCGCTGGCACCAGTAGCAAAGGATGGCATCTTTTCCCAAACTGATGGAGAATC
ATACCCAATCCCACAAGCCGGGCTGAGTGATGATGTGGAAGTCATGGACTTGATAGCAAATGTAATAAGT
GAAGGGGATAAGAGTGCAGTGGATGCCCAATCTGATTCTGCATCCATGGAGGACATTTCAGAGGCAG`
>`>g2.t1 CDS=1-97 geneID=g2
ATGGCAGGCTGCGTTACAGGCGGGCTTATAGCATACAGAGTTGTGTCACTCACCGAAAGCATTTCTAGTA
GCAAGTTATCATCATGGAGCAGCCTGG`

### Step 4. Back to the tag-based_RNAseq tutorial for mapping: preparing reference transcriptome
Create bowtie2 index for transcriptome.
`$ ml Bowtie2/2.4.5-GCC-10.2.0`
`$ bowtie2-build pisaster_transcriptome.fa pisaster_transcriptome.fa`
Here is the text output I was given when I executed the above code:
[bowtie2_build_output_june.18.2022.txt](https://github.com/z0on/tag-based_RNAseq/files/9117442/bowtie2_build_output_june.18.2022.txt)

### Step 5. Map reads to transcriptome.
Move relevant perl script to working dir, create `maps` file as in tutorial (this is where I figured out how to use `screen`). 
`$ ml Perl/5.32.1-GCCcore-10.3.0`
`$ perl tagseq_bowtie2map.pl "trim$" pisaster_transcriptome.fa > maps`
`$ screen`
`$ ml Bowtie2/2.4.5-GCC-10.2.0`
`$ . maps`
Here is the text output I was given when I executed the above code:
[maps_screenOutputFile.txt](https://github.com/z0on/tag-based_RNAseq/files/9117474/maps_screenOutputFile.txt)

### Step 6. (_attempt to_) generate read counts per gene. 
This is where my big "roadblock" is. At first, I thought I needed to make a true `transcriptome_seq2gene.tab` table, and was really confused by the instructions, because I didn't use Trinity. So I relied on another [Github repo](https://github.com/sachitsaksena/nfkb-tag-seq) that is actually derived from this one to at least generate a populated `seq2iso.tab` file without any error messages. The version of that code that I attempted to modify to fit my needs was:
`$ grep '^>' pisaster_transcriptome.fa |\ 
 perl -pe 's/>(\S+)\s.*gene:(\S+).*/\1\t\2/g' >\
 pisaster_transcriptome_seq2iso.tab`

The first two lines of this .tab file are: 
>`>g1.t1 CDS=1-277 geneID=g1`
>`>g2.t1 CDS=1-97 geneID=g2`

As I'm sure you can guess, this did not work. After running the following: 
`$ perl ./samcount_launch_bt2.pl '\.sam' pisaster_transcriptome_seq2iso.tab > sc`
`$ screen`
`$ ml Bowtie2/2.4.5-GCC-10.2.0`
`$ ml Perl/5.32.1-GCCcore-10.3.0`
`$ . sc`
...I received a massive chain of error messages (eg `g19414.t1 has no isogroup designation`) and the `.sam.counts` files were empty. 

Finally, today, it dawned on me that this line in the instructions is (I believe) relevant to me: 
> if your transcriptome is made in silico from annotated genome data, just use a dummy table in the form: 
`# gene	gene`

I followed these instructions very literally and created a dummy .tab file named "dummy_seq2iso.tab" with one line of text "gene [tab] gene". However, when I reran the above code with this new .tab file, I received this error x3 (one for each sample, I presume):
> `disregarding reads mapping to multiple isogroups
cannot open dummy_seq2iso.tab`

I made sure restrictive file permissions were not the cause of this error and did a lot of sleuthing around on the internet, but I am at a loss on what to do next. 

Any insight on this last step specifically, or if you believe the error is in one of these previous steps I've described, would be IMMENSELY appreciated. 

Thank you so much in advance.

Paige Duffin


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read-counts-per-gene step error- likely stemming from seq2gene.tab file? #4

Step 1. Prepare workspace and concatenate raw reads

Step 2. Adaptor trimming

Step 3. Create transcriptome from reference genome.

Step 4. Back to the tag-based_RNAseq tutorial for mapping: preparing reference transcriptome

Step 5. Map reads to transcriptome.

Step 6. (attempt to) generate read counts per gene.

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Read-counts-per-gene step error- likely stemming from seq2gene.tab file? #4

Description

Step 1. Prepare workspace and concatenate raw reads

Step 2. Adaptor trimming

Step 3. Create transcriptome from reference genome.

Step 4. Back to the tag-based_RNAseq tutorial for mapping: preparing reference transcriptome

Step 5. Map reads to transcriptome.

Step 6. (attempt to) generate read counts per gene.

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions