Skip to content

Improve Monomer Input Schema: Preserve Counts and Support Named Monomers #32

@janitha-mahanthe

Description

@janitha-mahanthe

Improve Monomer Input Schema: Preserve Counts and Support Named Monomers

Summary

In the current input format (v0.1-beta.0), monomers and their counts are defined separately:

{
    "simulation_name": "MySimulation",
    "temperature": [300, 400, 500],
    "density": 0.8,
    "monomers": {
        "1": "O=C(O)CCCCO",
        "2": "O=C(O)c1ccc(O)cc1",
        "3": "CCO"
    },
    "number_of_monomers": {
        "1": 500,
        "2": 400,
        "3": 300
    }
}

However:

  1. The workflow currently drops or does not consistently preserve the number_of_monomers information.
  2. The schema only supports numeric keys ("1", "2", etc.).
  3. There is no direct support for user-defined monomer names.

Problems

  1. Monomer count information may be ignored or lost in later stages of the workflow.
  2. Monomer identity relies on numeric indexing instead of semantic identifiers.
  3. No support for descriptive monomer names (e.g., "adipic_acid", "terephthalic_acid").
  4. Numeric indexing is basically blind

Expected Behavior

  • Monomer SMILES and counts must remain tightly coupled.
  • Monomer counts must persist throughout the workflow.
  • The system should optionally support named monomers.

###Proposed Schema Improvement
Replace the split dictionaries with a unified monomer structure.

The system should accept either a named dictionary:

{
    "monomers": [
        {
            "name": "adipic_acid",
            "smiles": "O=C(O)CCCCO",
            "count": 500
        },
        {
            "name": "terephthalic_acid",
            "smiles": "O=C(O)c1ccc(O)cc1",
            "count": 400
        },
        {
            "name": "ethanol",
            "smiles": "CCO",
            "count": 300
        }
    ]
}

or a minimal list-based format:

{
    "monomers": [
        {
            "smiles": "O=C(O)CCCCO",
            "count": 500
        },
        {
            "smiles": "O=C(O)c1ccc(O)cc1",
            "count": 400
        },
        {
            "smiles": "CCO",
            "count": 300
        }
    ]
}

formats should be normalized by the input parser into a canonical internal representation:

[
    {
        "id": 1,
        "name": "adipic_acid",
        "smiles": "O=C(O)CCCCO",
        "count": 500
    },
    {
        "id": 2,
        "name": "terephthalic_acid",
        "smiles": "O=C(O)c1ccc(O)cc1",
        "count": 400
    },
    {
        "id": 3,
        "name": None,
        "smiles": "CCO",
        "count": 300
    }
]

Metadata

Metadata

Labels

bugSomething isn't workingenhancementNew feature or request

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions