Skip to content

RO-4895 add WF operations for ligand quality and residue rscc statistical reference calculations#46

Open
shaochenghua wants to merge 20 commits intomasterfrom
RO-4895
Open

RO-4895 add WF operations for ligand quality and residue rscc statistical reference calculations#46
shaochenghua wants to merge 20 commits intomasterfrom
RO-4895

Conversation

@shaochenghua
Copy link

Added two modules under py-rcsb_workflow/rcsb/workflow/stats, and their execution operations in exdb_wf_cli.
Expected outputs are two files in CACHE:
"ligand_score_reference.csv" for ligand quality reference from --op "ligand_quality_ref_gen";
"rscc-thresholds.json" for residue RSCC reference from --op "residue_rscc_ref_gen".

Local testing command example:
exdb_wf_cli --op ligand_quality_ref_gen --config_path /Users/chenghua/Projects/RCSB_quality_reference/test_run/py-rcsb_workflow/rcsb/mock-data/config/dbload-setup-example.yml --config_name site_info_configuration --cache_path .

To make the above command work, I had to tweak the current mock-data/config/dbload-setup-example.yml to add MONGO_DB_URI.

Also, the two test files of "testLigandQualityReferenceGenerator.py" and "testResidueRsccReferenceGenerator.py" may need your update in their setUp() method to make them work with the mock data, i.e. please pay attention to configPath and defaultSectionName setting. Once the setup is proper, no further update is needed to make the unit tests work.

Copy link
Contributor

@piehld piehld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shaochenghua! This looks fantastic. I still need to look more closely at the actual data processing tasks, but I have a handful of more high-level comments that I thought I'd leave now.

In addition, what you think about maybe calling the directory refstats instead of just stats? Would that be appropriate? I think I'd like that more.

Last, Azure is showing these linting errors:

rcsb/workflow/stats/ResidueRsccReferenceGenerator.py:149:12: W0622: Redefining built-in 'bin' (redefined-builtin)
rcsb/workflow/stats/ResidueRsccReferenceGenerator.py:571:12: W0622: Redefining built-in 'id' (redefined-builtin)

Comment on lines 234 to 241
# suiteSelect.addTest(ResidueRsccReferenceGeneratorTests("testFetchEntry"))
# suiteSelect.addTest(ResidueRsccReferenceGeneratorTests("testFetchEntity"))
# suiteSelect.addTest(ResidueRsccReferenceGeneratorTests("testProcessEntity"))
# suiteSelect.addTest(ResidueRsccReferenceGeneratorTests("testFetchInstance"))
# suiteSelect.addTest(ResidueRsccReferenceGeneratorTests("testProcessInstance"))
# suiteSelect.addTest(ResidueRsccReferenceGeneratorTests("testProcessResidue"))
# suiteSelect.addTest(ResidueRsccReferenceGeneratorTests("testCalculatePercentiles"))
# suiteSelect.addTest(ResidueRsccReferenceGeneratorTests("testGenerateBin"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these commented out intentionally or by accident? Are they particularly intensive tests?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all tests commented out are step test for code development only. The last uncommented test run ensures all above test have been tested. No need to run them redundantly and they are kept for future development only.

Comment on lines 217 to 241
try:
self.verifyResolution(resolution_bin)
except InvalidParametersError as e:
logger.error("invalid resolution bin of %s: %s", resolution_bin, e)
return False
# Construct MongDB query
logger.info("to fetch entry ID for resolution bin %s", resolution_bin)
self.bin["resolution"] = resolution_bin
[high, low] = resolution_bin
collectionName = self.__collections["entry"] # use core_entry collection
collection = db[collectionName]
d_condition = {"rcsb_entry_info.experimental_method": "X-ray",
"rcsb_entry_info.resolution_combined": {
"$gte": high, "$lt": low
}
} # high <= bin < low
# Run find
try:
cursor = collection.find(d_condition, {"_id": 0, "rcsb_id": 1})
self.bin["entry_ids"] = [doc["rcsb_id"] for doc in cursor] # collect IDs in a list only
logger.info("%s PDB X-ray entries found within the resolution bin %s", len(self.bin["entry_ids"]), resolution_bin)
return True
except Exception as e:
logger.error("failed to fetch entry data from MongoDB for resolution bin %s, %s", resolution_bin, e)
return False
Copy link
Contributor

@piehld piehld Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate your close paralleling of how most other pre-existing methods generally return True or False, but that's actually something we're trying to undo now since it can lead to silent failures. Instead, it would be better to raise the exception so it gets propagated upwards. This should generally be followed if you've applied this behavior elsewhere too.

Also, make sure that whatever is calling this method doesn't catch the Exception and fail silently. (I know that the top-caller in rcsb/workflow/wuw/ExDbWorkflow.py expects a return of True or False (set to ok), and that is OK [for now] since eventually a False will bubble up to an Exception. I don't want you to worry about modifying all the existing code; just the new code that you're introducing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done as suggested on all low level methods that now raise featured exceptions. The top level generate method still output True/False as an execution indicator.

Comment on lines +185 to +190
self.cRRRG.fetchEntry(db, [0, 0.6])
self.cRRRG.fetchEntity(db)
self.cRRRG.processEntity()
self.cRRRG.fetchInstance(db)
self.cRRRG.processInstance()
self.cRRRG.processResidue()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that you're not checking the returned value here from each of these methods either is all the more reason to make sure that the individual method raises an exception when it fails; else, I don't think the test would fail if one of them returns False.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now they don't return True/False any more.

Comment on lines 40 to 42
self.__cfgOb = ConfigUtil(configPath=configPath,
defaultSectionName="site_info_configuration",
mockTopPath=self.__mockTopPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor formatting request (here and elsewhere)—the standard we generally follow for multi-line parentheses in our py-rcsb_* codebases is this:

Suggested change
self.__cfgOb = ConfigUtil(configPath=configPath,
defaultSectionName="site_info_configuration",
mockTopPath=self.__mockTopPath)
self.__cfgOb = ConfigUtil(
configPath=configPath,
defaultSectionName="site_info_configuration",
mockTopPath=self.__mockTopPath
)

I.e., the closing parenthesis should be at the same indent as the indent at which it was first opened, with everything inside the parentheses on its own line underneath the opener (with an additional level of indent).

Can you follow that standard where applicable?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done as suggested

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants