Skip to content

Parse Molecular Formula #1

@ctapobep

Description

@ctapobep

The goal is to parse molecular formulas (MF) like H2O. The result is a structure that tells us how many and which atoms are present in MF. The resulting algorithm should be the fastest possible, but it's interesting to look at the elegant ones as well.

Use cases to handle (Java Tests):

Input Result Comment
H2O H2O
CH3CH3 C2H6
Na.Cl NaCl Multiple molecules or counterions are combined
C(CH3CH3)2 C5H12 Group coefficient 2 applied to (xxx)
C(CH3CH3) C3H6 If no group coefficient, then it's 1
(C(OH)2)2 C2H4O4 Nested parentheses
2NH3 N2H6 Group coefficient at the start
2NH3.4CH3 N2C4H20 Each component can have its own coefficient
[HSO4]- HSO4 The charge is simply ignored
[HSO4]2- HSO4 Double charge is ignored too
[(2H2O.NaCl)3S.N]2- H12O6Na3Cl3SN A hairy example
1.3 million dataset exported from Meve*

Not needed:

  • The order of elements in the resulting MF doesn't have to be conventional - any order is fine. Here we're concentrated on the parsing only.
  • The whole topic of isotopes can be ignored

*This dataset doesn't contain anything with parentheses. So in the benchmarks we generate modifications with the parentheses.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions