Gff3 file: differences between nextclade datasets and augur ancestral

Hi! I am not sure if this is a question for nextclade or augur.

I am using `augur ancestral` as part of a pipeline to create a tree for use in a nextclade dataset: https://github.com/anna-parker/marburg-virus-tree/tree/main. I decided to simplify matters and use the same gff3 file I use in my nextclade dataset - with the goal of having CDS-regions named the same in the nextclade tree and alignment.

However, I realized that `nextclade` and `augur ancestral` appear to read the gff3 file differently. For example this is my annotation for the `NP` CDS (full file here: https://github.com/GenSpectrum/nextclade-datasets/blob/add_marburg/data/marburg/unreleased/genome_annotation.gff3 - it is the same as the gff3 file from genbank expect that I have renamed the CDS by adding a `Name=NP` field to the start of the annotations)
```
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
##sequence-region NC_001608.3 1 19111
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=3052505
NC_001608.3	RefSeq	region	1	19111	.	+	.	ID=NC_001608.3:1..19111;Dbxref=taxon:3052505;country=Kenya;gbkey=Src;genome=genomic;isolate=Marburg virus/H.sapiens-tc/KEN/1980/Mt. Elgon-Musoke;mol_type=viral cRNA;old-name=Lake Victoria marburgvirus
NC_001608.3	RefSeq	gene	49	2844	.	+	.	ID=gene-MARV_gp1;Dbxref=GeneID:920944;Name=NP;gbkey=Gene;gene=NP;gene_biotype=protein_coding;locus_tag=MARV_gp1
NC_001608.3	RefSeq	mRNA	49	2844	.	+	.	ID=rna-MARV_gp1;Parent=gene-MARV_gp1;Dbxref=GeneID:920944;gbkey=mRNA;gene=NP;locus_tag=MARV_gp1;product=nucleoprotein
NC_001608.3	RefSeq	exon	49	2844	.	+	.	ID=exon-MARV_gp1-1;Parent=rna-MARV_gp1;Dbxref=GeneID:920944;gbkey=mRNA;gene=NP;locus_tag=MARV_gp1;product=nucleoprotein
NC_001608.3	RefSeq	CDS	104	2191	.	+	0	Name=NP;ID=cds-YP_001531153.1;Parent=rna-MARV_gp1;Dbxref=GenBank:YP_001531153.1,GeneID:920944;Name=YP_001531153.1;Note=encapsidates RNA genome;gbkey=CDS;gene=NP;locus_tag=MARV_gp1;product=nucleoprotein;protein_id=YP_001531153.1
...
``` 
When creating a nextclade dataset only the CDS field is used and is given the name NP. Which makes sense according to the [nextclade docs](https://docs.nextstrain.org/projects/nextclade/en/latest/user/input-files/03-genome-annotation.html) as:
`When a linked gene and CDS are present (CDSs specify their parents by listing the gene’s ID in the Parent attribute), the gene is effectively ignored for all purposes but display in the web UI. CDS segments are joined if they have the same ID, otherwise they are treated as independent.`

However, when using this same file in `augur ancestral` the gene and not the CDS region is used (I can tell because the gene is longer and I get a lot more mutations). I then removed the gene and left only the CDS field, then `augur ancestral` did no ancestral reconstruction. Only when I renamed the CDS field to gene (see https://github.com/anna-parker/marburg-virus-tree/blob/main/config/reference.gff3) did `augur ancestral` reconstruct the CDS the same way as in nextclade. 

Is this expected behavior? Does `augur ancestral` only perform ancestral reconstruction on genes? I couldn't find any docs on the way `augur ancestral` expects gff3 files to be formatted. 

Side-note: I was previously using a genbank file and not a gff3 file, and there `augur ancestral` used the CDS and not the gene, my main reason for changing to a gff3 file was to rename the CDS.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gff3 file: differences between nextclade datasets and augur ancestral #1655

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Gff3 file: differences between nextclade datasets and augur ancestral #1655

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions