-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathOverlapGraph.py
More file actions
358 lines (280 loc) · 13 KB
/
OverlapGraph.py
File metadata and controls
358 lines (280 loc) · 13 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
from igraph import Graph as IGraph
class OverlapGraph(object):
"""
@class OverlapGraph:
Class for a bidirected overlap graph G = (V,E) where V = \{0,...,self.n-1\}
is the set of reads/vertices and E is the set of read overlaps.
Below are examples/descriptions of each kind of overlap and their associated
semantic representation in the graph.
012*45678*
############ u ---------> beg(u) = 3, end(u) = 9, len(u) = 9
# Regular # ||||||
# Dovetail # v ----------> beg(v) = 0, end(v) = 6, len(v) = 10
############ *12345*789
In the above case, a suffix of u aligns to a prefix of v. This is called
a Regular Dovetail. In the overlap graph, we represent this with a directed
bidirected edge (u >--> v). We label this edge (semantically) with the overhanging
suffix of v which does not align with u, namely v[end(v)..).
Technically, this is all the information required for this scenario and this
is a valid bidirected edge. However, since we will be using the graph to perform
walks/traversals to build contigs, we can make the job easier right now to save
time later. Specifically, imagine that in our traversal we passed from v to u,
which is the opposite direction of the edge's orientation. In such a case we don't
care about the suffix of v; we need the prefix of u, and more specifically, the
reverse complement of the prefix of u. It is possible to calculate this from
the information given so far. Namely, if l(u >--> v) denotes the length of the
sequence labeling the edge (u >--> v), so that v[|v|-l..) is the label, then we
can calculate the length of the overhanging prefix of u to be m = |u| + l - |v|,
and then the prefix of interest is comp(u[..m)). However, it is easier to just label
this immediately. We accomplish this by creating a shadow edge for the prefix direction,
which can be looked up as if we were looking up the "directed edge" (v --> u), and in
that location we store comp(u[..m)). A similar procedure of storing "shadow" edges for
quickly looking up forward and backward traversals of a bidirected edge will be taken
for the other kinds of edges as well.
012*45678*
############ u ---------> beg(u) = 3, end(u) = 9, len(u) = 9
# Prefix # ||||||
# Dovetail # v <---------- beg(v) = 9, end(v) = 3, len(v) = 10
############ *87654*210
In this case, a suffix of u aligns to a prefix of comp(v), which is the reverse
complement of a suffix of v. In the overlap graph, we represent this with an
introverted bidirected edge (u >--< v). We label the (u->v) direction with the
sequence comp(v[..end(v)]) and the (v->u) direction with the sequence u[..beg(u)).
TODO: finish
@var self.n: Number of reads/vertices in graph.
@var self.g: Adjacency list of graph. self.g[u][v] = (dir(u,v), suflen(u,v))
# @var self.g: Adjacency list of graph. For a given u \in V,
# if u -> v is an edge in the graph then self.g[u][v]
# exists and is equal to the
@var self.seqs: Tuple of raw sequencing reads.
"""
def __init__(self, seqs : tuple):
self.n = len(seqs)
self.g = {u : dict() for u in range(self.n)}
self.seqs = seqs
def get_pruned(self):
"""
@func get_pruned: TODO
"""
contained = set()
for u in self.g:
for v in self.g[u]:
if self.g[u][v][0] == -1:
contained.add(v)
pruned = OverlapGraph(self.seqs)
for u in self.g:
if not u in contained:
for v in self.g[u]:
if not v in contained:
pruned.add_overlap(u, v, self.g[u][v][0], self.g[u][v][1])
return pruned
def get_naive_tr(self, fuzz):
"""
@func get_naive_tr: TODO
"""
g = self.g
reduce = {u : {} for u in g}
for u in g:
for v in g[u]:
reduce[u][v] = False
def edir(u, v): return g[u][v][0]
def suflen(u, v): return g[u][v][1]
def arrows(u, v): return ((g[u][v][0]>>1)&1, g[u][v][0]&1)
for u in g:
for v in g[u]:
uv_head, uv_tail = arrows(u,v)
for w in set(g[v]).intersection(g[u]):
vw_head, vw_tail = arrows(v,w)
uw_head, uw_tail = arrows(u,w)
if uv_head == uw_head and vw_tail == uw_tail and uv_tail != vw_head:
if suflen(u,w) + suflen(w,v) >= suflen(u,w) + fuzz:
reduce[u][w] = True
reduce[w][u] = True
tr = OverlapGraph(self.seqs)
for u in g:
for w in g[u]:
if not reduce[u][w]:
assert not reduce[w][u]
tr.add_overlap(u, w, edir(u, w), suflen(u, w))
tr.add_overlap(w, u, edir(w, u), suflen(w, u))
return tr
def add_overlap(self, u : int, v : int, d : int, l : int):
"""
@instance_method: add_overlap
Adds an overlap edge to the graph.
@param u: The source read index.
@param v: The target read index.
@param d: The direction going from u to v
@param l: The overhanging suffix length going from u to v
@raise IndexError: If given invalid read indices.
"""
if u < 0 or u >= self.n or v < 0 or v >= self.n:
raise IndexError("Trying to add overlap with out of bounds indices")
if v in self.g[u] and self.g[u][v][1] <= l:
return
self.g[u][v] = (d, l)
def num_edges(self) -> int:
"""
@instance_method num_edges:
Calculates the number of overlaps/edges in the graph.
This operation takes O(|V|) time each call because the edge count
is not stored internally and must be recalculated each time.
"""
return sum(len(adj) for v,adj in self.g.items())
def get_igraph(self) -> IGraph:
"""
@instance_method get_igraph:
Constructs and returns a new igraph object created from the contents
of the self graph.
@returns G: igraph object
"""
G = IGraph(self.n, directed = True)
edges = []
dirs = []
suflens = []
for u, adj in self.g.items():
for v, e in adj.items():
edges.append((u,v))
dirs.append(e[0])
suflens.append(e[1])
G.add_edges(edges)
G.es['dir'] = dirs
G.es['len'] = suflens
G.vs['seq'] = self.seqs
return G
# @classmethod
# def generate_overlapped(cls, seqs : tuple, overlap_seeds : list): # -> cls:
# n = len(seqs)
# g = cls(seqs)
@classmethod
def generate_seed_based(cls, seqs : tuple, overlap_seeds : list, k : int): # -> cls:
"""
@classmethod generate_seed_based:
Constructs a new OverlapGraph object using k-mer based overlap_seeds
generated from @FindOverlaps.get_overlap_seeds. Each seeded overlap is
extended as if the remaining non-seeded parts are assumed to be perfect
matches. Given multiple seeds between two given reads, The first seed
parsed is used and the rest are discarded.
@param seqs: Tuple of reads ordered by read index.
@param overlap_seeds: List of overlap seeds. See @FindOverlaps.get_overlap_seeds.__doc__
for details.
@param k: The seed length. This is needed for reverse complement overlaps, otherwise
we can't find the correct overlaps.
@return instantiated OverlapGraph object.
"""
n = len(seqs)
g = cls(seqs)
for u, v, upos, vpos, rc in overlap_seeds:
useq = seqs[u]
vseq = seqs[v]
ulen = len(useq)
vlen = len(vseq)
if rc: vpos = vlen - vpos - k - 1
if upos <= vpos and (ulen - upos) <= (vlen - vpos):
g.add_overlap(v, u, -1, 0)
elif upos >= vpos and (ulen - upos) >= (vlen - vpos):
g.add_overlap(u, v, -1, 0)
else:
if upos > vpos:
suflen = (vlen - vpos) - (ulen - upos)
prelen = upos - vpos
if not rc:
g.add_overlap(u, v, 1, suflen)
g.add_overlap(v, u, 2, prelen)
else:
g.add_overlap(u, v, 0, suflen)
g.add_overlap(v, u, 0, prelen)
else:
suflen = vpos - upos
prelen = (ulen - upos) - (vlen - vpos)
if not rc:
g.add_overlap(u, v, 2, suflen)
g.add_overlap(v, u, 1, prelen)
else:
g.add_overlap(u, v, 3, suflen)
g.add_overlap(v, u, 3, prelen)
return g
@classmethod
def generate_gold_standard(cls, seqs : tuple, records : list, genome_length : int): # -> cls: (annotations don't work for this use case apparently)
"""
@classmethod generate_gold_standard:
Constructs a new OverlapGraph object usingt input reads and
their gold standard mapping back to original reference.
@param seqs: Tuple of reads ordered be read index.
@param records: Read triplets. Look at AssemblyProblem.create_reads.__doc__ function for definition.
@param genome_length: Reference genome length; used to calculate overlaps
that cross the circular barrier.
@raise ValueError: If len(seqs) != len(records).
@raise AssertionError: If records are incorrectly sorted (should never happen).
@return instantiated OverlapGraph object.
"""
n = len(seqs)
if len(records) != n:
raise ValueError("len(seqs) != len(records)")
g = cls(seqs)
records = sorted(records, key=lambda t: t[1]) * 2
for ufind in range(n):
u, upos, urev = records[ufind]
useq = seqs[u]
ulen = len(useq)
for vfind in range(ufind+1, 2*n):
if vfind < n:
v, vpos, vrev = records[vfind]
else:
v, vpos, vrev = records[vfind - n]
vpos += genome_length
assert upos <= vpos
vseq = seqs[v]
vlen = len(vseq)
if vpos >= upos + ulen:
"""
If vpos >= upos + ulen, then the closest read that starts
to the right of u is out of range, so we can cutoff early.
"""
break
if vpos + vlen <= upos + ulen:
"""
If vpos + vlen <= upos + ulen, then v must be contained inside
u.
"""
g.add_overlap(u, v, -1, 0)
else:
"""
If we reach here, then v extends to the right of u by >= 1 nucleotides.
If upos == vpos, then u must be contained within v.
"""
if upos == vpos:
g.add_overlap(v, u, -1, 0)
else:
"""
Reaching here means (upos < vpos) and (upos + ulen < vpos + vlen)
"""
suflen = vpos + vlen - upos - ulen
prelen = vpos - upos
if not urev:
if not vrev:
g.add_overlap(u, v, 1, suflen)
g.add_overlap(v, u, 2, prelen)
else:
g.add_overlap(u, v, 0, suflen)
g.add_overlap(v, u, 0, prelen)
else:
if not vrev:
g.add_overlap(u, v, 3, suflen)
g.add_overlap(v, u, 3, prelen)
else:
g.add_overlap(u, v, 2, suflen)
g.add_overlap(v, u, 1, prelen)
return g
def print(self):
"""
@func print: TODO
"""
for u in self.g:
for v in self.g[u]:
arrow = [' ', '--', ' ']
d, l = self.g[u][v]
t, h = ((d>>1)&1, d&1)
arrow[0] = '>' if t==0 else '<'
arrow[-1] = '>' if h==1 else '<'
print("({}) {} ({}) :: {}".format(u, "".join(arrow), v, l))