The goal is to parse molecular formulas (MF) like H2O. The result is a structure that tells us how many and which atoms are present in MF. The resulting algorithm should be the fastest possible, but it's interesting to look at the elegant ones as well.
Use cases to handle (Java Tests):
| Input |
Result |
Comment |
H2O |
H2O |
|
CH3CH3 |
C2H6 |
|
Na.Cl |
NaCl |
Multiple molecules or counterions are combined |
C(CH3CH3)2 |
C5H12 |
Group coefficient 2 applied to (xxx) |
C(CH3CH3) |
C3H6 |
If no group coefficient, then it's 1 |
(C(OH)2)2 |
C2H4O4 |
Nested parentheses |
2NH3 |
N2H6 |
Group coefficient at the start |
2NH3.4CH3 |
N2C4H20 |
Each component can have its own coefficient |
[HSO4]- |
HSO4 |
The charge is simply ignored |
[HSO4]2- |
HSO4 |
Double charge is ignored too |
[(2H2O.NaCl)3S.N]2- |
H12O6Na3Cl3SN |
A hairy example |
|
|
1.3 million dataset exported from Meve* |
Not needed:
- The order of elements in the resulting MF doesn't have to be conventional - any order is fine. Here we're concentrated on the parsing only.
- The whole topic of isotopes can be ignored
*This dataset doesn't contain anything with parentheses. So in the benchmarks we generate modifications with the parentheses.
The goal is to parse molecular formulas (MF) like
H2O. The result is a structure that tells us how many and which atoms are present in MF. The resulting algorithm should be the fastest possible, but it's interesting to look at the elegant ones as well.Use cases to handle (Java Tests):
H2OH2OCH3CH3C2H6Na.ClNaClC(CH3CH3)2C5H12(xxx)C(CH3CH3)C3H6(C(OH)2)2C2H4O42NH3N2H62NH3.4CH3N2C4H20[HSO4]-HSO4[HSO4]2-HSO4[(2H2O.NaCl)3S.N]2-H12O6Na3Cl3SNNot needed:
*This dataset doesn't contain anything with parentheses. So in the benchmarks we generate modifications with the parentheses.