Skip to content

Optional layer for non-InChI identifiers? #1

@Artoria2e5

Description

@Artoria2e5

The MInChI Demo page includes some interesting mixfiles (well, if you "copy branch" it's basically a JSON mixfile without mixfileVersion) with unknown InChI structures such as:

  • No structures at all: BSA blocking buffer + PBS; bechamel sauce
  • Partial lack: Dodecacarbonyltriiron

Right now the produced InChI is a little less than informative for these purposes. I propose adding an optional layer /x (external identifiers) to handle this problem.

/x layer

The /x layer consists of the following parts:

  • A main part, consisting of percent-encoded strings separated by the character &. Characters that MUST be encoded are / , &, unprintable characters, and whitespace characters. (I choose this style because it originates in an environment that uses & and /.)
    • The use of + in place of %20 for encoding a space is permitted. (Purely aesthetic reasons.)
  • A mandatory /n sublayer which is very similar to the /n layer, but with the ability to associate multiple strings to a substance as well as the ability to name a group. (This will cause some duplication of information in the nesting structure. We already do that with /g.)
  • An optional /t sublayer specifying the type of the identifier in the main part. This layer contains a string, each character being a description of the corresponding index in the &-separated field. Acceptable types include (each of these have a Mixfile counterpart):
    • f: formula (likely used when: unknown connectivity so unable to make InChI, has numbers in a range so unable to make InChI)
    • s: SMILES
    • n: Human-readable name
    • k: InChIKey
    • (I could specify one for Molfile here but the size would be comical. A URL-safe base64 encoding of gzipped Molfile? Nah sounds too complicated.)
    • (There are some additional database references that can be added, though these will NOT have a Mixfile counterpart. It could make sense to just write another "name" for now.)

The /x layer shall only appear on non-"standard MInChI", i.e. "MInChI=0.00.1" without the "S". There is too much variability for anything to be reproducible here. Lucky we don't have a MInChIKey...

Basic example (with whitespace added)

MInChI=0.00.1//n{{&}&}/g{{466wf-3&534wf-3}91wf-3&909wf-3}
 /xbutter&flour&flour+dispersed+in+butter&milk&bechamel+sauce
  /n{{1&2}3&4}5
  /tnnnnn

Example of three identifiers on the same thing:

MInChI=0.00.1/C6H14/c1-3-5-6-4-2/h3-6H2,1-2H3/n{&1}/g{1:5pp0&}
 /xOctacarbonyldicobalt&Co2(CO)8&PubChem_CID:25049
  /n{1,2,3&}
  /tnfn

On /n

When an /n sublayer is present, it should have the same "shape-of-braces" as the main /n layer. The format is the same as the main /n layer, with the exception that

  • each structure can have multiple descriptions for the main part. This is resolved by allowing the use of a comma , between numbers describing the same part.
  • each brace-grouping may have its own label. This is handled by permitting number-lists to be used after the closing brace, before the &. (This resembles Newick format.)

About names

/x is currently unused and a good sound match. I think it's an acceptable use of a letter, unless someone has some other use in mind (e.g. using /x like the x- prefix of MIME types for experimental/extensions in general).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions