computing ambiguity codes in consensus sequence

Erick pointed me at alnvu because he was under the impression that it would emit IUPAC [ambiguity codes](http://www.bioinformatics.org/sms/iupac.html)  when calculating a consensus sequence.  I learned that alnvu does try to calculate ambiguity codes, but this code is normally turned off.  In trying to adapt this code for my own use, I noticed that the calculation of ambiguous consensus sequences has a couple of problems.

commenting on alnvu/util.py::consensus():
- `rdict = dict([(v,k) for k,v in tabdict.items()])`
  tries to reverse the sense of values and keys in a dictionary but does not verify that the new keys are unique.   The net effect is to silently drop all-but-the-last nucleotides that have identical frequency counts.  A better way to do this is to maintain a separate list of keys, ordered by the dictionary values, e. g.
  `sortedkeys = sorted(tabdict.keys(), key=lambda k: tabdict[k], reverse=True)`
- IUPAC_rev is indexed by all nucleotide values, even those with extremely low abundance.   If 'A' and 'C' have a million hits, and 'G' & 'T' have only one each, the base should probably be called as 'M', not as 'N'.   I don't know what "real" algorithms do, but I have chosen to base selection of ambiguity codes on the top nucleotides whose frequencies are at least half the next higher frequency.  That means {A:90%, C:3%, G:4%, T:3%} would be called as 'A' and {A:40%, C:30%, G:11%, T:14%}  would be called as {A,C}=M.   The right answer probably depends on the application.
- The key tuples in the IUPAC_Rev dictionary are expected to be sorted alphabetically, - 
  `IUPAC_rev.get(tuple(sorted(tabdict.keys())), errorchar)`
  but the one for 'N' is incorrect.
  `('G', 'A', 'T', 'C'): 'N',` should be  `('A', 'C', 'G', 'T'): 'N'`.  See https://github.com/nhoffman/Seq/blob/master/Dictionaries.py


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

computing ambiguity codes in consensus sequence #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

computing ambiguity codes in consensus sequence #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions