Genome annotations for segmented CDSs etc

Auspice will soon be able to parse an extended version of the [genome_annotation](https://github.com/nextstrain/augur/blob/master/augur/data/schema-export-v2.json#L53-L92) which will allow segmented CDSs, wrapping CDSs and extra metadata. We need a way to export this information from Augur.

### Schema changes

**general**

The `nuc` strand cannot be `"-"` (-ve strand genomes are represented as their reverse complement). This isn't a change to Auspice's behavior but is now enforced. The strand is optional for `nuc`.

Each key/object pair in `genome_annotations` now corresponds to a CDS rather than a gene and our language (help / schema etc) should be updated accordingly. 

CDS length is now verified to be a multiple of 3 (within Auspice) and if it's not the CDS is not displayed.

**segmented CDS**

The `start` and `end` properties may be omitted and replaced with an array of segments, each with 1-based (GFF) coordinates:

```js
segments?: {start: number, end: number, name?: string}[]
```

The order of segments is important and corresponds to the order the respective translations appear in the protein sequence. Note that `start<end` always, even if the CDS is on the negative strand.

_(The `name` for an individual segment may or may not be part of the schema, I will update this once we've finalised it in Auspice.)_

**wrapping CDS**

A wrapping CDS may be expressed via segments (as above) or by specifying an `end` coordinate beyond the length of the genome, following GFF format.

**(optional) metadata**

`gene?: string`: Displayed in the on-hover tooltip in Auspice. If multiple CDSs have the same gene then they will be given the same colour by Auspice.

`color?: string`: User specified colour for the CDS. The value must be a CSS colour string or a colour hex. 

`display_name?: string`: A more verbose name for the CDS, shown in the on-hover info box.

`description?: string`: Shown in the on-hover info box.

**(future) within CDS annotations**

_This isn't yet in Auspice, but will be, so it is important context to consider when implementing the Augur side of things. The syntax may change slightly._

```ts
features?: {
  name: string;
  description: string;
  segments: {start: number, end: number}[];
}[]
```

### Current augur workflow

(Remember that augur currently can't handle complex CDSs)

1. `augur ancestral` infers nuclotide sequences and [adds the "nuc" annotation to the resulting node-data JSON](https://github.com/nextstrain/augur/blob/master/augur/ancestral.py#L289-L290)
2. `augur translate` uses a GenBank / GFF file to translate simple CDSs and [creates per-CDS (per-gene) annotations in the resulting node-data JSON](https://github.com/nextstrain/augur/blob/master/augur/translate.py#L395-L400)
3. `augur export` simply [passes this annotations block through](https://github.com/nextstrain/augur/blob/master/augur/export_v2.py#L503-L505) to the final dataset JSON

### Future augur workflow

I believe the typical augur workflow, especially for complex CDSs will be:

1. Generate per-CDS translations for each sample via Nextclade, which uses (exclusively?) GFF annotations
2. Use `augur ancestral` to infer the translated sequences, per CDS, for internal nodes (implemented via [this PR](https://github.com/nextstrain/augur/pull/1258)). **This is probably the place to also generate the annotations block, which means that `augur ancestral` will need to parse the GFF file initially used by Nextclade.**
3. `augur export` simply passes this annotations block through to the final JSON, as it currently does.

I'm not sure what this means for the future of `augur translate`, but the approach we use within `augur ancestral` to parse the GFF (or GenBank?) file and create the resulting JSON annotations will be able to be used by `augur translate` if we wish to do so in the future.

There is still the question of how to export optional metadata for each CDS, such as "display name", "color" etc. For the time being I think it's ok to leave this as a script-based "optional extra" for workflows, but perhaps others see a nice way to implement this.

### Related issues / discussions:

https://github.com/nextstrain/augur/issues/953 covers GFF parsing in augur

We've discussed parsing GFFs within Augur recently in [this slack thread](https://bedfordlab.slack.com/archives/C01LCTT7JNN/p1690294452250119)

Auspice PR related to this work is [here](https://github.com/nextstrain/auspice/pull/1684)

### Which pathogens does this affect?

For the time being, not many.
* nCoV - if we maintain the separate ORF1a and ORF1b for backwards compatibility, the only CDS that needs to utilise the new schema is RdRP (NSP12). We may also want to choose our own colours etc. This can easily be done by a small script in the pipeline.
* HepB - a prototype repo not yet in live production
* Ebola - not in current development or seeing new genomes, but we should add the slip site at some point.
* HIV - unsure of Richard's plans here

### Path forward

This can be implemented in steps / multiple PRs

- [x] Update the schema, which at the least would allow workflows to introduce the new annotations via scripts. ~I will try to do this shortly.~ This is done in https://github.com/nextstrain/augur/pull/1281
- [ ] Parse GFF/GenBank (?) files and export them in this format
- [ ] Implement the within-CDS features later on once Auspice supports them?




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Genome annotations for segmented CDSs etc #1280

Schema changes

Current augur workflow

Future augur workflow

Related issues / discussions:

Which pathogens does this affect?

Path forward

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Genome annotations for segmented CDSs etc #1280

Description

Schema changes

Current augur workflow

Future augur workflow

Related issues / discussions:

Which pathogens does this affect?

Path forward

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions