-
Notifications
You must be signed in to change notification settings - Fork 81
Description
The software liftoff compares two assemblies of very closely related genomes, one of which has gene annotations, and the other does not. It then tries to find which regions in the "bare" genomes align well with the "annotated" genome, and then uses the alignment and the gene coordinates of the "annotated" genome to predict the gene coordinates of the equivalent gene in the "bare" genome.
When a template gene is found once in the "annotated" genome but multiple times in the "bare" genome, the attribute ID of the template gene is retained for the first copy in the "bare" genome, but for the second copy a _1 is appended to the attribute ID.
See for example an excerpt from a resultant liftoff gff3 file (toy.edit.gff3):
Seq_18 Liftoff gene 653259 653653 . + . ID=QZVMAR_2781_gene_1
Seq_18 Liftoff CDS 653259 653334 . + . ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54_1;Parent=QZVMAR_2781_gene_1
Seq_18 Liftoff CDS 653370 653458 . + . ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54_1;Parent=QZVMAR_2781_gene_1
Seq_18 Liftoff CDS 653489 653543 . + . ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54_1;Parent=QZVMAR_2781_gene_1
Seq_18 Liftoff CDS 653577 653653 . + . ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54_1;Parent=QZVMAR_2781_gene_1
Seq_28 Liftoff gene 494693 495087 . - . ID=QZVMAR_2781_gene
Seq_28 Liftoff CDS 494693 494769 . - . ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54;Parent=QZVMAR_2781_gene
Seq_28 Liftoff CDS 494803 494857 . - . ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54;Parent=QZVMAR_2781_gene
Seq_28 Liftoff CDS 494888 494976 . - . ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54;Parent=QZVMAR_2781_gene
Seq_28 Liftoff CDS 495012 495087 . - . ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54;Parent=QZVMAR_2781_gene
The QZVMAR_2781_gene of the original genome has been found twice in the target genome, and the gene and CDS features of the second copy (here on top of the file) have been appended with a _1
As you can imagine, since the merge_strategy='create_unique' uses a similar strategy, this leads to perhaps some unexpected behavior. Applying (trials.py)
db = gffutils.create_db('toy.edit.gff3', ':memory:', merge_strategy='create_unique')
to this GFF3 file leads to the following error:
Traceback (most recent call last):
File "/Users/joran/miniconda3/envs/gffutils/lib/python3.6/site-packages/gffutils/create.py", line 622, in _populate_from_lines
self._insert(f, c)
File "/Users/joran/miniconda3/envs/gffutils/lib/python3.6/site-packages/gffutils/create.py", line 566, in _insert
cursor.execute(constants._INSERT, feature.astuple())
sqlite3.IntegrityError: UNIQUE constraint failed: features.id
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./trials.py", line 5, in <module>
db = gffutils.create_db('toy.edit.gff3', ':memory:', merge_strategy='create_unique')
File "/Users/joran/miniconda3/envs/gffutils/lib/python3.6/site-packages/gffutils/create.py", line 1401, in create_db
c.create()
File "/Users/joran/miniconda3/envs/gffutils/lib/python3.6/site-packages/gffutils/create.py", line 543, in create
self._populate_from_lines(self.iterator)
File "/Users/joran/miniconda3/envs/gffutils/lib/python3.6/site-packages/gffutils/create.py", line 656, in _populate_from_lines
self._insert(f, c)
File "/Users/joran/miniconda3/envs/gffutils/lib/python3.6/site-packages/gffutils/create.py", line 566, in _insert
cursor.execute(constants._INSERT, feature.astuple())
sqlite3.IntegrityError: UNIQUE constraint failed: features.id
My guess of what is going on here is that once the script gets to the second CDS of the second gene (the 9th line), it will try to append ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54 with a _1. However, its result, ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54_1 is already an ID of a CDS of the first gene! Hence, the unique requirement fails and the script crashes.
I've come up with following workaround:
def id_func(x):
new_id = ''
if x.featuretype == 'gene':
new_id = x.attributes['ID'][0]
elif x.featuretype == 'CDS':
new_id = '-'.join( [x.attributes['ID'][0], x.seqid, str(x.start), str(x.end)] )
return new_id
db = gffutils.create_db('toy.edit.gff3', id_spec=id_func, dbfn=':memory:')
which basically creates a custom ID for each CDS feature, using the ID, the start and end coordinates to make sure it is unique.
In any case, I just wanted to point out this unfortunate coincidence between Liftoff and gffutils which leads to errors.
Perhaps it would be possible to update the create_unique function to take into account such cases?