Skip to content

Need an automated curation method for existing builds #126

@Lachele

Description

@Lachele

Is your feature request related to a problem? Please describe.
See, for example, #119

Describe the solution you'd like
We need a tool that can automatically scrape the existing builds and find any that seem wrong. Once we are confident, this tool can do something about them automatically. At first, though, it should probably just alert humans about any problems.

The tool needs design, but it might look something like:

  1. First ensure that all Builds have valid links to Sequences and vice-versa (the vice-versa is known to have occasionally been broken). We will need to determine what to do with missing symlinks, missing targets, etc. First, we need to see if they are there, how many are there, why they are there, etc.
  2. For all sequences that pass, run sugar ID on the min-gas.pdb file to ensure that the sugar-ID sequence matches the directory sequence. If any fail, these should be moved and investigated to try to figure out what went wrong. I choose min-gas.pdb rather than structure.off because if the minimization changes the sequence we need to know about it. If min-gas.pdb fails, we should check structure.off to see if it was wrong before or if it became wrong.
  3. For multi-conformer sequences, ensure that each conformer id is valid (do an evaluation in gmml or such).
  4. For multi-conformer sequences, ensure that each conformer is the conformer it should be. This comparison should be made on both the structure.off file and the min-gas.pdb file. If structure.off has a mismatch, we have a bug. if min-gas has a mismatch, the minimization distorted the structure significantly. We should look at these and decide what to do. This test will require submitting the structure to gmml and getting back an apparent conformer id. I'm not sure if gmml has this capability. @gitoliver should have an opinion here.

I think this should be run periodically, depending on how long it takes, maybe weekly or once a month. We could also keep a list of "probably good" structures (passed several times before) that only get checked once every couple months or so. We can hash out exact timings later.

Describe alternatives you've considered
Some poor human has to go look at them all... 8000 sequences and 18000 builds. Yeah. Any volunteers?

Metadata

Metadata

Labels

GEMSfeatureAdd something that doesn't exist yet

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions