Skip to content

Combinatorial (search-engine-like) proteoform generation in ProForma #194

@levitsky

Description

@levitsky

cc @mobiusklein

I have a use case for generation of all versions of a peptidoform with and without variable modifications applied, same way as parser.peptidoforms does for modX. We touched on this briefly in #190. I was looking into the ProForma machinery to see if I can make use of the classes there. It appears that we would need to add a switch in ProteoformCombinator.generate() to do this, where instead of going over itertools.product(*position_choices) we would go over the product of all possible subsets of each item in position_choices item. Does it sound right?

Another question is the user-facing API for this. As far as I understand, the user can encode the rules into the sequence itself with something like '[Oxidation|Position:M]?PEMPTIMDE', and that would work for ProteoformCombinator (assuming we implement the new iteration logic). However, no clean way seems to exist to pass external variable_rules to the constructor. So if I have a list of unmodified sequences and a set of modification settings, I need to either slap an unlocalized_modifications list onto each ProForma object, or manually create the rules for ProteoformCombinator like this:

from pyteomics.proforma import (
    UnimodModification, ModificationRule, ModificationTarget,
    GeneratorModificationRuleDirective, ProteoformCombinator, ProForma
)

oxidation = UnimodModification('35')
rule = ModificationRule(oxidation, ModificationTarget('M'))
directive = GeneratorModificationRuleDirective(rule)
sequence = ProForma.parse('PEMPTIMDE')
combinator = ProteoformCombinator(sequence)
combinator.variable_rules = [directive]

for pf in combinator.generate(all_combinations=True):  # not implemented yet
   print(pf)

Is it the case that there is no friendlier way to set up ProteoformCombinator ? Do you think we can create some reasonable shortcut?

Potential ideas would be:

  • document a way to instantiate a rule from a string, like '[Oxidation|Position:M]', or just accept a mapping like parser.peptidoforms, creating the ModificationTarget and ModificationRule objects from keys and values;
  • a way to pass a list[GeneratorModificationRuleDirective] to ProteoformCombinator constructor, or perhaps implicitly create them from ModificationRules;
  • a way to reuse a ProteoformCombinator instance for multiple sequences, or maybe an alternative class that only applies external modification rules/directives. Alternatively, we can just have a user-facing function encapsulating the code above. Then it doesn't really matter if it's slightly longer or shorter, although it would be nice to avoid any repeated parsing or cache its results.

I can try to put together an implementation but I wanted to hear your thoughts.

On a related note, currently if you specify Position for an unlocalized modification, ProteoformCombinator places this modification on possible sites and it still has the same Position modifier there. Is this desirable?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions