RO-4895 add WF operations for ligand quality and residue rscc statistical reference calculations#46
RO-4895 add WF operations for ligand quality and residue rscc statistical reference calculations#46shaochenghua wants to merge 20 commits intomasterfrom
Conversation
…rence computation
…tion into workflow
piehld
left a comment
There was a problem hiding this comment.
Thanks @shaochenghua! This looks fantastic. I still need to look more closely at the actual data processing tasks, but I have a handful of more high-level comments that I thought I'd leave now.
In addition, what you think about maybe calling the directory refstats instead of just stats? Would that be appropriate? I think I'd like that more.
Last, Azure is showing these linting errors:
rcsb/workflow/stats/ResidueRsccReferenceGenerator.py:149:12: W0622: Redefining built-in 'bin' (redefined-builtin)
rcsb/workflow/stats/ResidueRsccReferenceGenerator.py:571:12: W0622: Redefining built-in 'id' (redefined-builtin)
| # suiteSelect.addTest(ResidueRsccReferenceGeneratorTests("testFetchEntry")) | ||
| # suiteSelect.addTest(ResidueRsccReferenceGeneratorTests("testFetchEntity")) | ||
| # suiteSelect.addTest(ResidueRsccReferenceGeneratorTests("testProcessEntity")) | ||
| # suiteSelect.addTest(ResidueRsccReferenceGeneratorTests("testFetchInstance")) | ||
| # suiteSelect.addTest(ResidueRsccReferenceGeneratorTests("testProcessInstance")) | ||
| # suiteSelect.addTest(ResidueRsccReferenceGeneratorTests("testProcessResidue")) | ||
| # suiteSelect.addTest(ResidueRsccReferenceGeneratorTests("testCalculatePercentiles")) | ||
| # suiteSelect.addTest(ResidueRsccReferenceGeneratorTests("testGenerateBin")) |
There was a problem hiding this comment.
Are these commented out intentionally or by accident? Are they particularly intensive tests?
There was a problem hiding this comment.
all tests commented out are step test for code development only. The last uncommented test run ensures all above test have been tested. No need to run them redundantly and they are kept for future development only.
| try: | ||
| self.verifyResolution(resolution_bin) | ||
| except InvalidParametersError as e: | ||
| logger.error("invalid resolution bin of %s: %s", resolution_bin, e) | ||
| return False | ||
| # Construct MongDB query | ||
| logger.info("to fetch entry ID for resolution bin %s", resolution_bin) | ||
| self.bin["resolution"] = resolution_bin | ||
| [high, low] = resolution_bin | ||
| collectionName = self.__collections["entry"] # use core_entry collection | ||
| collection = db[collectionName] | ||
| d_condition = {"rcsb_entry_info.experimental_method": "X-ray", | ||
| "rcsb_entry_info.resolution_combined": { | ||
| "$gte": high, "$lt": low | ||
| } | ||
| } # high <= bin < low | ||
| # Run find | ||
| try: | ||
| cursor = collection.find(d_condition, {"_id": 0, "rcsb_id": 1}) | ||
| self.bin["entry_ids"] = [doc["rcsb_id"] for doc in cursor] # collect IDs in a list only | ||
| logger.info("%s PDB X-ray entries found within the resolution bin %s", len(self.bin["entry_ids"]), resolution_bin) | ||
| return True | ||
| except Exception as e: | ||
| logger.error("failed to fetch entry data from MongoDB for resolution bin %s, %s", resolution_bin, e) | ||
| return False |
There was a problem hiding this comment.
I appreciate your close paralleling of how most other pre-existing methods generally return True or False, but that's actually something we're trying to undo now since it can lead to silent failures. Instead, it would be better to raise the exception so it gets propagated upwards. This should generally be followed if you've applied this behavior elsewhere too.
Also, make sure that whatever is calling this method doesn't catch the Exception and fail silently. (I know that the top-caller in rcsb/workflow/wuw/ExDbWorkflow.py expects a return of True or False (set to ok), and that is OK [for now] since eventually a False will bubble up to an Exception. I don't want you to worry about modifying all the existing code; just the new code that you're introducing.
There was a problem hiding this comment.
done as suggested on all low level methods that now raise featured exceptions. The top level generate method still output True/False as an execution indicator.
| self.cRRRG.fetchEntry(db, [0, 0.6]) | ||
| self.cRRRG.fetchEntity(db) | ||
| self.cRRRG.processEntity() | ||
| self.cRRRG.fetchInstance(db) | ||
| self.cRRRG.processInstance() | ||
| self.cRRRG.processResidue() |
There was a problem hiding this comment.
The fact that you're not checking the returned value here from each of these methods either is all the more reason to make sure that the individual method raises an exception when it fails; else, I don't think the test would fail if one of them returns False.
There was a problem hiding this comment.
Now they don't return True/False any more.
| self.__cfgOb = ConfigUtil(configPath=configPath, | ||
| defaultSectionName="site_info_configuration", | ||
| mockTopPath=self.__mockTopPath) |
There was a problem hiding this comment.
Minor formatting request (here and elsewhere)—the standard we generally follow for multi-line parentheses in our py-rcsb_* codebases is this:
| self.__cfgOb = ConfigUtil(configPath=configPath, | |
| defaultSectionName="site_info_configuration", | |
| mockTopPath=self.__mockTopPath) | |
| self.__cfgOb = ConfigUtil( | |
| configPath=configPath, | |
| defaultSectionName="site_info_configuration", | |
| mockTopPath=self.__mockTopPath | |
| ) |
I.e., the closing parenthesis should be at the same indent as the indent at which it was first opened, with everything inside the parentheses on its own line underneath the opener (with an additional level of indent).
Can you follow that standard where applicable?
…g, update version, format update
… works with mock data
… path instead of redefining it here
Added two modules under py-rcsb_workflow/rcsb/workflow/stats, and their execution operations in exdb_wf_cli.
Expected outputs are two files in CACHE:
"ligand_score_reference.csv" for ligand quality reference from --op "ligand_quality_ref_gen";
"rscc-thresholds.json" for residue RSCC reference from --op "residue_rscc_ref_gen".
Local testing command example:
exdb_wf_cli --op ligand_quality_ref_gen --config_path /Users/chenghua/Projects/RCSB_quality_reference/test_run/py-rcsb_workflow/rcsb/mock-data/config/dbload-setup-example.yml --config_name site_info_configuration --cache_path .
To make the above command work, I had to tweak the current mock-data/config/dbload-setup-example.yml to add MONGO_DB_URI.
Also, the two test files of "testLigandQualityReferenceGenerator.py" and "testResidueRsccReferenceGenerator.py" may need your update in their setUp() method to make them work with the mock data, i.e. please pay attention to configPath and defaultSectionName setting. Once the setup is proper, no further update is needed to make the unit tests work.