Skip to content

merge_strategy='create_unique' not always creating unique id's #214

@novigit

Description

@novigit

The software liftoff compares two assemblies of very closely related genomes, one of which has gene annotations, and the other does not. It then tries to find which regions in the "bare" genomes align well with the "annotated" genome, and then uses the alignment and the gene coordinates of the "annotated" genome to predict the gene coordinates of the equivalent gene in the "bare" genome.

When a template gene is found once in the "annotated" genome but multiple times in the "bare" genome, the attribute ID of the template gene is retained for the first copy in the "bare" genome, but for the second copy a _1 is appended to the attribute ID.

See for example an excerpt from a resultant liftoff gff3 file (toy.edit.gff3):

Seq_18  Liftoff gene    653259  653653  .       +       .       ID=QZVMAR_2781_gene_1
Seq_18  Liftoff CDS     653259  653334  .       +       .       ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54_1;Parent=QZVMAR_2781_gene_1
Seq_18  Liftoff CDS     653370  653458  .       +       .       ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54_1;Parent=QZVMAR_2781_gene_1
Seq_18  Liftoff CDS     653489  653543  .       +       .       ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54_1;Parent=QZVMAR_2781_gene_1
Seq_18  Liftoff CDS     653577  653653  .       +       .       ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54_1;Parent=QZVMAR_2781_gene_1

Seq_28  Liftoff gene    494693  495087  .       -       .       ID=QZVMAR_2781_gene
Seq_28  Liftoff CDS     494693  494769  .       -       .       ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54;Parent=QZVMAR_2781_gene
Seq_28  Liftoff CDS     494803  494857  .       -       .       ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54;Parent=QZVMAR_2781_gene
Seq_28  Liftoff CDS     494888  494976  .       -       .       ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54;Parent=QZVMAR_2781_gene
Seq_28  Liftoff CDS     495012  495087  .       -       .       ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54;Parent=QZVMAR_2781_gene

The QZVMAR_2781_gene of the original genome has been found twice in the target genome, and the gene and CDS features of the second copy (here on top of the file) have been appended with a _1

As you can imagine, since the merge_strategy='create_unique' uses a similar strategy, this leads to perhaps some unexpected behavior. Applying (trials.py)

db = gffutils.create_db('toy.edit.gff3', ':memory:', merge_strategy='create_unique')

to this GFF3 file leads to the following error:

Traceback (most recent call last):
  File "/Users/joran/miniconda3/envs/gffutils/lib/python3.6/site-packages/gffutils/create.py", line 622, in _populate_from_lines
    self._insert(f, c)
  File "/Users/joran/miniconda3/envs/gffutils/lib/python3.6/site-packages/gffutils/create.py", line 566, in _insert
    cursor.execute(constants._INSERT, feature.astuple())
sqlite3.IntegrityError: UNIQUE constraint failed: features.id

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./trials.py", line 5, in <module>
    db = gffutils.create_db('toy.edit.gff3', ':memory:', merge_strategy='create_unique')
  File "/Users/joran/miniconda3/envs/gffutils/lib/python3.6/site-packages/gffutils/create.py", line 1401, in create_db
    c.create()
  File "/Users/joran/miniconda3/envs/gffutils/lib/python3.6/site-packages/gffutils/create.py", line 543, in create
    self._populate_from_lines(self.iterator)
  File "/Users/joran/miniconda3/envs/gffutils/lib/python3.6/site-packages/gffutils/create.py", line 656, in _populate_from_lines
    self._insert(f, c)
  File "/Users/joran/miniconda3/envs/gffutils/lib/python3.6/site-packages/gffutils/create.py", line 566, in _insert
    cursor.execute(constants._INSERT, feature.astuple())
sqlite3.IntegrityError: UNIQUE constraint failed: features.id

My guess of what is going on here is that once the script gets to the second CDS of the second gene (the 9th line), it will try to append ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54 with a _1. However, its result, ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54_1 is already an ID of a CDS of the first gene! Hence, the unique requirement fails and the script crashes.

I've come up with following workaround:

def id_func(x):
    new_id = ''
    if x.featuretype == 'gene':
        new_id = x.attributes['ID'][0]
    elif x.featuretype == 'CDS':
        new_id = '-'.join( [x.attributes['ID'][0], x.seqid, str(x.start), str(x.end)] )
    return new_id

db = gffutils.create_db('toy.edit.gff3', id_spec=id_func, dbfn=':memory:')

which basically creates a custom ID for each CDS feature, using the ID, the start and end coordinates to make sure it is unique.

In any case, I just wanted to point out this unfortunate coincidence between Liftoff and gffutils which leads to errors.

Perhaps it would be possible to update the create_unique function to take into account such cases?

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions