RO-4895 add WF operations for ligand quality and residue rscc statistical reference calculations by shaochenghua · Pull Request #46 · rcsb/py-rcsb_workflow

shaochenghua · 2026-01-12T00:20:03Z

Added two modules under py-rcsb_workflow/rcsb/workflow/stats, and their execution operations in exdb_wf_cli.
Expected outputs are two files in CACHE:
"ligand_score_reference.csv" for ligand quality reference from --op "ligand_quality_ref_gen";
"rscc-thresholds.json" for residue RSCC reference from --op "residue_rscc_ref_gen".

Local testing command example:
exdb_wf_cli --op ligand_quality_ref_gen --config_path /Users/chenghua/Projects/RCSB_quality_reference/test_run/py-rcsb_workflow/rcsb/mock-data/config/dbload-setup-example.yml --config_name site_info_configuration --cache_path .

To make the above command work, I had to tweak the current mock-data/config/dbload-setup-example.yml to add MONGO_DB_URI.

Also, the two test files of "testLigandQualityReferenceGenerator.py" and "testResidueRsccReferenceGenerator.py" may need your update in their setUp() method to make them work with the mock data, i.e. please pay attention to configPath and defaultSectionName setting. Once the setup is proper, no further update is needed to make the unit tests work.

…rence computation

…iles calculation

…tion into workflow

piehld

Thanks @shaochenghua! This looks fantastic. I still need to look more closely at the actual data processing tasks, but I have a handful of more high-level comments that I thought I'd leave now.

In addition, what you think about maybe calling the directory refstats instead of just stats? Would that be appropriate? I think I'd like that more.

Last, Azure is showing these linting errors:

rcsb/workflow/stats/ResidueRsccReferenceGenerator.py:149:12: W0622: Redefining built-in 'bin' (redefined-builtin)
rcsb/workflow/stats/ResidueRsccReferenceGenerator.py:571:12: W0622: Redefining built-in 'id' (redefined-builtin)

pyproject.toml

piehld · 2026-01-21T18:08:03Z

rcsb/workflow/tests/testResidueRsccReferenceGenerator.py

+    # suiteSelect.addTest(ResidueRsccReferenceGeneratorTests("testFetchEntry"))
+    # suiteSelect.addTest(ResidueRsccReferenceGeneratorTests("testFetchEntity"))
+    # suiteSelect.addTest(ResidueRsccReferenceGeneratorTests("testProcessEntity"))
+    # suiteSelect.addTest(ResidueRsccReferenceGeneratorTests("testFetchInstance"))
+    # suiteSelect.addTest(ResidueRsccReferenceGeneratorTests("testProcessInstance"))
+    # suiteSelect.addTest(ResidueRsccReferenceGeneratorTests("testProcessResidue"))
+    # suiteSelect.addTest(ResidueRsccReferenceGeneratorTests("testCalculatePercentiles"))
+    # suiteSelect.addTest(ResidueRsccReferenceGeneratorTests("testGenerateBin"))


Are these commented out intentionally or by accident? Are they particularly intensive tests?

all tests commented out are step test for code development only. The last uncommented test run ensures all above test have been tested. No need to run them redundantly and they are kept for future development only.

piehld · 2026-01-21T18:22:06Z

rcsb/workflow/stats/ResidueRsccReferenceGenerator.py

+        try:
+            self.verifyResolution(resolution_bin)
+        except InvalidParametersError as e:
+            logger.error("invalid resolution bin of %s: %s", resolution_bin, e)
+            return False
+        # Construct MongDB query
+        logger.info("to fetch entry ID for resolution bin %s", resolution_bin)
+        self.bin["resolution"] = resolution_bin
+        [high, low] = resolution_bin
+        collectionName = self.__collections["entry"]  # use core_entry collection
+        collection = db[collectionName]
+        d_condition = {"rcsb_entry_info.experimental_method": "X-ray",
+                       "rcsb_entry_info.resolution_combined": {
+                           "$gte": high, "$lt": low
+                       }
+                       }  # high <= bin < low
+        # Run find
+        try:
+            cursor = collection.find(d_condition, {"_id": 0, "rcsb_id": 1})
+            self.bin["entry_ids"] = [doc["rcsb_id"] for doc in cursor]  # collect IDs in a list only
+            logger.info("%s PDB X-ray entries found within the resolution bin %s", len(self.bin["entry_ids"]), resolution_bin)
+            return True
+        except Exception as e:
+            logger.error("failed to fetch entry data from MongoDB for resolution bin %s, %s", resolution_bin, e)
+            return False


I appreciate your close paralleling of how most other pre-existing methods generally return True or False, but that's actually something we're trying to undo now since it can lead to silent failures. Instead, it would be better to raise the exception so it gets propagated upwards. This should generally be followed if you've applied this behavior elsewhere too.

Also, make sure that whatever is calling this method doesn't catch the Exception and fail silently. (I know that the top-caller in rcsb/workflow/wuw/ExDbWorkflow.py expects a return of True or False (set to ok), and that is OK [for now] since eventually a False will bubble up to an Exception. I don't want you to worry about modifying all the existing code; just the new code that you're introducing.

done as suggested on all low level methods that now raise featured exceptions. The top level generate method still output True/False as an execution indicator.

piehld · 2026-01-21T18:23:23Z

rcsb/workflow/tests/testResidueRsccReferenceGenerator.py

+            self.cRRRG.fetchEntry(db, [0, 0.6])
+            self.cRRRG.fetchEntity(db)
+            self.cRRRG.processEntity()
+            self.cRRRG.fetchInstance(db)
+            self.cRRRG.processInstance()
+            self.cRRRG.processResidue()


The fact that you're not checking the returned value here from each of these methods either is all the more reason to make sure that the individual method raises an exception when it fails; else, I don't think the test would fail if one of them returns False.

Now they don't return True/False any more.

piehld · 2026-01-21T18:27:50Z

rcsb/workflow/tests/testResidueRsccReferenceGenerator.py

+            self.__cfgOb = ConfigUtil(configPath=configPath,
+                                      defaultSectionName="site_info_configuration",
+                                      mockTopPath=self.__mockTopPath)


Minor formatting request (here and elsewhere)—the standard we generally follow for multi-line parentheses in our py-rcsb_* codebases is this:

Suggested change

self.__cfgOb = ConfigUtil(configPath=configPath,

defaultSectionName="site_info_configuration",

mockTopPath=self.__mockTopPath)

self.__cfgOb = ConfigUtil(

configPath=configPath,

defaultSectionName="site_info_configuration",

mockTopPath=self.__mockTopPath

)

I.e., the closing parenthesis should be at the same indent as the indent at which it was first opened, with everything inside the parentheses on its own line underneath the opener (with an additional level of indent).

Can you follow that standard where applicable?

Done as suggested

rcsb/workflow/tests/testLigandQualityReferenceGenerator.py

…g, update version, format update

… works with mock data

… path instead of redefining it here

shaochenghua added 7 commits December 17, 2025 16:54

create new directory stats for ligand quality reference and RSCC refe…

9a80ead

…rence computation

RO-4895, Automation of ligand quality and RSCC statistics reference f…

4066cf7

…iles calculation

update resolution_index from float to int

88fb81b

use context manager for MongoDB, add ligand and rscc reference genera…

b06b8ff

…tion into workflow

minor update to pass flake8 check

54c6016

minor update for CLI and config file loading

cd499b9

minor update to unit test to reduce run time

7d5687f

piehld requested changes Jan 21, 2026

View reviewed changes

shaochenghua and others added 12 commits February 11, 2026 22:56

Merge branch 'master' into RO-4895

164bcb4

use folder refstats for quality ref gen, update all Exception handlin…

c89d59d

…g, update version, format update

update CLI workflow

981b450

fix minor issues revealed by Azure check

8f7ddfc

minor adjustments

9cea30c

update from python 3.12 to 3.13

a21f186

load more PDB IDs for testing, un-skip all unit tests

9472342

add test instances for ligand quality reference test

277c24c

update testing PDB ID up limit from 38 to 40

1b7c298

enable all tests in testResidueRsccReferenceGenerator to make sure it…

d5e9d4b

… works with mock data

add PDB IDs for RSCC test

fb95600

add support for stashing

23d66b3

This was referenced Feb 24, 2026

V0.98 Adjust RcsbLigandScoreProvider backup and restore strategy rcsb/py-rcsb_utils_chemref#21

Open

V1.37 Update RcsbLigandScoreProvider configuration in DictMethodResourceProvider rcsb/py-rcsb_utils_dictionary#94

Open

use RcsbLigandScoreProvider.getLigandScoreDataPath() to get data file…

0203481

… path instead of redefining it here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RO-4895 add WF operations for ligand quality and residue rscc statistical reference calculations#46

RO-4895 add WF operations for ligand quality and residue rscc statistical reference calculations#46
shaochenghua wants to merge 20 commits intomasterfrom
RO-4895

shaochenghua commented Jan 12, 2026

Uh oh!

piehld left a comment

Uh oh!

Uh oh!

piehld Jan 21, 2026

Uh oh!

shaochenghua Feb 13, 2026

Uh oh!

piehld Jan 21, 2026 •

edited

Loading

Uh oh!

shaochenghua Feb 13, 2026

Uh oh!

piehld Jan 21, 2026

Uh oh!

shaochenghua Feb 13, 2026

Uh oh!

piehld Jan 21, 2026

Uh oh!

shaochenghua Feb 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shaochenghua commented Jan 12, 2026

Uh oh!

piehld left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

piehld Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

shaochenghua Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

piehld Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shaochenghua Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

piehld Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

shaochenghua Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

piehld Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

shaochenghua Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

piehld Jan 21, 2026 •

edited

Loading