Parse Molecular Formula

The goal is to parse molecular formulas (MF) like `H2O`. The result is a structure that tells us how many and which atoms are present in MF. The resulting algorithm should be the fastest possible, but it's interesting to look at the elegant ones as well.

Use cases to handle ([Java Tests](https://github.com/elsci-io/chemikaze/blob/master/src/test/java/io/elsci/chemikaze/MfParserTestTest.java)):

| Input        | Result    | Comment                                        |
|--------------|-----------|------------------------------------------------|
| `H2O`        | `H2O`     |                                                |
| `CH3CH3`     | `C2H6`    |                                                |
| `Na.Cl`      | `NaCl`    | Multiple molecules or counterions are combined |
| `C(CH3CH3)2` | `C5H12`   | Group coefficient 2 applied to `(xxx)`         |
| `C(CH3CH3)`  | `C3H6`    | If no group coefficient, then it's 1           |
| `(C(OH)2)2`  | `C2H4O4`  | Nested parentheses                             |
| `2NH3`       | `N2H6`    | Group coefficient at the start                 |
| `2NH3.4CH3`  | `N2C4H20` | Each component can have its own coefficient    |
| `[HSO4]-`    | `HSO4`    | The charge is simply ignored                   |
| `[HSO4]2-`   | `HSO4`    | Double charge is ignored too                   |
| `[(2H2O.NaCl)3S.N]2-`   | `H12O6Na3Cl3SN`    | A hairy example                   |
| |     | [1.3 million dataset](https://github.com/elsci-io/chemikaze/blob/master/src/test/resources/MFs.csv) exported from [Meve](https://meve.elsci.io/)*|


Not needed:

* The order of elements in the resulting MF doesn't have to be conventional - any order is fine. Here we're concentrated on the parsing only.
* The whole topic of isotopes can be ignored

_*This dataset doesn't contain anything with parentheses. So in the [benchmarks](https://github.com/elsci-io/chemikaze/blob/master/src/test/java/io/elsci/chemikaze/MfParserBenchmark.java) we generate modifications with the parentheses._

Input	Result	Comment
`H2O`	`H2O`
`CH3CH3`	`C2H6`
`Na.Cl`	`NaCl`	Multiple molecules or counterions are combined
`C(CH3CH3)2`	`C5H12`	Group coefficient 2 applied to `(xxx)`
`C(CH3CH3)`	`C3H6`	If no group coefficient, then it's 1
`(C(OH)2)2`	`C2H4O4`	Nested parentheses
`2NH3`	`N2H6`	Group coefficient at the start
`2NH3.4CH3`	`N2C4H20`	Each component can have its own coefficient
`[HSO4]-`	`HSO4`	The charge is simply ignored
`[HSO4]2-`	`HSO4`	Double charge is ignored too
`[(2H2O.NaCl)3S.N]2-`	`H12O6Na3Cl3SN`	A hairy example
		1.3 million dataset exported from Meve*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse Molecular Formula #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parse Molecular Formula #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions