-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
Erick pointed me at alnvu because he was under the impression that it would emit IUPAC ambiguity codes when calculating a consensus sequence. I learned that alnvu does try to calculate ambiguity codes, but this code is normally turned off. In trying to adapt this code for my own use, I noticed that the calculation of ambiguous consensus sequences has a couple of problems.
commenting on alnvu/util.py::consensus():
rdict = dict([(v,k) for k,v in tabdict.items()])
tries to reverse the sense of values and keys in a dictionary but does not verify that the new keys are unique. The net effect is to silently drop all-but-the-last nucleotides that have identical frequency counts. A better way to do this is to maintain a separate list of keys, ordered by the dictionary values, e. g.
sortedkeys = sorted(tabdict.keys(), key=lambda k: tabdict[k], reverse=True)- IUPAC_rev is indexed by all nucleotide values, even those with extremely low abundance. If 'A' and 'C' have a million hits, and 'G' & 'T' have only one each, the base should probably be called as 'M', not as 'N'. I don't know what "real" algorithms do, but I have chosen to base selection of ambiguity codes on the top nucleotides whose frequencies are at least half the next higher frequency. That means {A:90%, C:3%, G:4%, T:3%} would be called as 'A' and {A:40%, C:30%, G:11%, T:14%} would be called as {A,C}=M. The right answer probably depends on the application.
- The key tuples in the IUPAC_Rev dictionary are expected to be sorted alphabetically, -
IUPAC_rev.get(tuple(sorted(tabdict.keys())), errorchar)
but the one for 'N' is incorrect.
('G', 'A', 'T', 'C'): 'N',should be('A', 'C', 'G', 'T'): 'N'. See https://github.com/nhoffman/Seq/blob/master/Dictionaries.py
Metadata
Metadata
Assignees
Labels
No labels