Feature/add more logging preprocessing by rukubrakov · Pull Request #23 · bittremieuxlab/simba

rukubrakov · 2026-04-17T15:08:56Z

Summary by CodeRabbit

Release Notes

New Features
- Dataset splitting now uses deterministic bucket-based assignments instead of ratio percentages
- Enhanced logging with detailed spectrum validation metrics and unique molecule counts
- Added preprocessing run comparison tool
Documentation
- Updated preprocessing configuration documentation for the new bucket-based split scheme
Tests
- Updated tests for bucket-based splitting approach

…nfig

coderabbitai · 2026-04-17T15:09:10Z

Warning

Rate limit exceeded

@rukubrakov has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 35 minutes and 24 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 35 minutes and 24 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a92b6880-f5ce-4e15-8ed6-441b5d8b7378

📥 Commits

Reviewing files that changed from the base of the PR and between a35d6f6 and 83a7058.

📒 Files selected for processing (3)

.pre-commit-config.yaml
pyproject.toml
simba/workflows/utils.py

📝 Walkthrough

Walkthrough

This PR replaces the ratio-based dataset split configuration (val_split/test_split) with a hash-bucket-based split scheme. Changes include updated documentation, new config parameters (n_buckets, train_buckets, val_buckets, test_buckets), refactored split logic in TrainUtils, enhanced logging across data loading and preprocessing pipelines, and a new comparison utility.

Changes

Cohort / File(s)	Summary
Documentation Updates `README.md`	Updated preprocessing documentation to replace fractional split configuration with explicit bucket-based split scheme; added new parameter descriptions and updated production example with bucket assignments.
Configuration Schema `simba/configs/preprocessing/default.yaml`, `simba/configs/preprocessing/fast_dev.yaml`	Removed `val_split` and `test_split` parameters; added `n_buckets`, `train_buckets`, `val_buckets`, and `test_buckets` with default bucket allocations.
CLI Validation `simba/commands/preprocess.py`	Updated validation logic from checking split fractions to validating bucket-set consistency (no overlaps, indices within bounds); updated error messaging and documentation examples.
Core Split Logic `simba/core/training/train_utils.py`	Added `_scaffold_bucket()` method using SHA-256 hashing for deterministic bucket assignment; refactored `train_val_test_split_bms()` signature and implementation to use bucket-based splitting instead of randomized percentile slicing; added logging for unique molecule counts in `get_unique_spectra()`.
Enhanced Data Loading Logging `simba/core/data/loaders/loaders.py`	Added per-spectrum counters (`spectra_processed`, `spectra_failed_validation`, `spectra_failed_parsing`, `spectra_failed_nan_check`, `spectra_yielded`) and detailed "MGF Loading Summary" with failure mode breakdown; enhanced GNPS validation logging with spectrum names and debug output.
Preprocessing Workflow Logging `simba/workflows/preprocessing.py`	Added molecule uniqueness tracking (canonical SMILES-based counts) across full dataset and splits; replaced ratio-based split parameters with bucket-based ones; added RDKit-based large-molecule reporting; enhanced distance-computation and array-filtering logging with partition-level observability.
Utility Logging `simba/workflows/utils.py`	Added informational logging at load/preprocess/filter stages, including unique molecule counts and filtering impact metrics; no control flow changes.
Test Updates `tests/integration/test_training_pipeline.py`, `tests/unit/test_train_utils.py`	Updated test invocations to use bucket-based parameters; added new `TestHashBasedSplit` test suite validating scaffold-bucket determinism, no overlap, and consistent assignment behavior.
New Comparison Tool `tools/compare_preprocessing_runs.py`	New executable script that compares two preprocessing run directories for consistency: extracts molecule sets and distance pairs per split, validates that shared molecules have identical computed distances across runs.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Feature/add community files #6: Modifies the same TrainUtils.train_val_test_split_bms and preprocessing CLI split logic at the code level.
Setting test infra, adding changes from feature/metadata #4: Updates TrainUtils.train_val_test_split_bms in the same training utilities module with overlapping parameter/logic changes.
Feature/move legacy code #5: Modifies simba/commands/preprocess.py with related dataset split configuration and validation logic.

Suggested reviewers

chevi1989
Janne98

Poem

🐰 Buckets of logic, hashed with care,
No more fractions in the air!
SHA-256 seeds the way,
Spectra sorted, come what may,
Scaffolds aligned in harmony's dance. ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 47.06% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The PR title 'Feature/add more logging preprocessing' is vague and only partially captures the main changes. While logging enhancements are included, the PR primarily implements a major architectural change from ratio-based to bucket-based dataset splitting, which is not reflected in the title.	Consider revising the title to emphasize the primary change: 'Refactor dataset splitting to use hash-bucket scheme' or 'Migrate from ratio-based to bucket-based dataset splitting with enhanced logging'.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/add_more_logging_preprocessing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

🧹 Nitpick comments (6)

simba/core/data/loaders/loaders.py (1)
581-583: Prefer the SpectrumExt.smiles attribute over params["smiles"].

SpectrumExt already exposes .smiles as a first-class attribute (used throughout train_utils.py, e.g. Chem.CanonSmiles(s.smiles)). Reading s.params["smiles"] here couples this log to the raw pyteomics params dict and requires the "smiles" in s.params guard; s.smiles is already normalized to "" when missing.
♻️ Suggested simplification
-        logger.info(f"Loaded {len(spectra)} spectra from MGF file (with filtering applied)")
-        unique_molecules = len(set(s.params["smiles"] for s in spectra if "smiles" in s.params))
-        logger.info(f"Unique molecules in loaded spectra: {unique_molecules}")
+        logger.info(f"Loaded {len(spectra)} spectra from MGF file (with filtering applied)")
+        unique_molecules = len({s.smiles for s in spectra if s.smiles})
+        logger.info(f"Unique molecules in loaded spectra: {unique_molecules}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@simba/core/data/loaders/loaders.py` around lines 581 - 583, Replace the use
of the raw params dict when counting unique molecules: in the block that logs
loaded spectra (variable spectra), compute unique_molecules from the
SpectrumExt.smiles attribute rather than s.params["smiles"] and remove the
"smiles" in s.params guard because SpectrumExt.smiles is normalized (empty
string when missing); update the logger.info call that reports Unique molecules
to use this new unique_molecules value (references: SpectrumExt.smiles, spectra,
unique_molecules, logger.info).
simba/core/training/train_utils.py (1)
44-54: Consider module-level hashlib import and guarding n_buckets.

Two small refinements:

import hashlib can be hoisted to the module top to avoid repeated local import lookups on every call (this is invoked once per scaffold in a hot loop).

n_buckets <= 0 will raise ZeroDivisionError at % n_buckets; a brief validation (or an assert) would produce a clearer error than the default train_buckets = list(range(n_buckets - 2)) path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@simba/core/training/train_utils.py` around lines 44 - 54, Hoist the import of
hashlib to the module top and add validation in _scaffold_bucket (function name:
_scaffold_bucket) to guard against non-positive bucket counts: check that
n_buckets is an int > 0 (raise ValueError or assert with a clear message) before
computing the digest, then compute the SHA-256 hex digest and return int(digest,
16) % n_buckets as before; this avoids repeated local imports in hot loops and
prevents a cryptic ZeroDivisionError when n_buckets <= 0.
simba/workflows/utils.py (1)
68-113: Minor duplication and unique-molecule counting caveat.

unique_molecules_original (Line 103–105) recomputes the same set already logged at Line 70. Compute it once and reuse.

s.params.get("smiles", "N/A") conflates every spectrum missing a SMILES into a single "N/A" bucket, so the reported unique molecule count can be understated for datasets where some spectra lack SMILES. Consider filtering out the fallback before taking set(...), or at least note in the log that missing-SMILES spectra are counted as one.

Line 99 logs "{len(filtered_spectra)} spectra remaining after filtering." and Lines 110–113 log a more detailed summary covering the same information — consider dropping Line 99 to reduce log noise.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@simba/workflows/utils.py` around lines 68 - 113, The logs recompute unique
molecule counts and conflate missing SMILES into a single "N/A" bucket; update
the preprocessing logging to compute unique_molecules_original once from
all_spectra (store and reuse instead of recomputing), compute
unique_molecules_filtered similarly but exclude the fallback value (filter out
params.get("smiles") values that are None or equal to the fallback "N/A" before
taking set) so missing-SMILES spectra aren't merged into one counted molecule,
and remove the redundant simple log of len(filtered_spectra) (the detailed
Filtering summary already reports remaining counts); look for usages of
all_spectra, filtered_spectra, unique_molecules_original,
unique_molecules_filtered, and Preprocessor to apply these changes.
tests/integration/test_training_pipeline.py (1)
49-74: Test is deterministic but brittle to scaffold string changes.

With only 5 distinct scaffolds (scaffold_0..scaffold_4) and selecting buckets [2], [4], [9] out of 10, each of val/test non-emptiness relies on SHA-256 of those specific strings falling into exactly those buckets. Any rename of the scaffold literals (or change to n_buckets) could silently break the > 0 assertions. Consider either picking bucket sets that cover contiguous ranges (so the pigeonhole guarantees non-emptiness for 5 items) or asserting on exact expected split sizes that you pre-computed — this makes the intent explicit.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/test_training_pipeline.py` around lines 49 - 74, The test
TrainUtils.train_val_test_split_bms usage is brittle because specific bucket
lists ([2], [4], [9]) rely on SHA256 of scaffold strings; update the test to
make it deterministic and robust by either (A) selecting bucket ranges that
guarantee non-empty picks for 5 distinct scaffolds (e.g., use contiguous bucket
ranges like train_buckets=range(0,4), val_buckets=range(4,7),
test_buckets=range(7,10) or equivalent lists) so pigeonhole ensures non-empty
splits given n_buckets=10, or (B) replace the loose >0 checks with exact
expected sizes for train/val/test computed from the current scaffold set
(calculate expected lengths once and assert equality against train, val, test).
Modify the call site in the test (TrainUtils.train_val_test_split_bms invocation
and the subsequent assertions on train, val, test) accordingly.
simba/workflows/preprocessing.py (2)
175-181: Duplication with the block at Lines 148-151 — consider a small helper.

The len({s.params.get("smiles", "N/A") for s in ...}) idiom is repeated four times in this function. Extracting a tiny helper would reduce duplication and make it trivial to fix the missing-SMILES handling in one place.
♻️ Optional helper
+def _count_unique_smiles(spectra) -> int:
+    return len({s.params["smiles"] for s in spectra if s.params.get("smiles")})
Then replace each callsite, e.g.:
-    train_unique = len({s.params.get("smiles", "N/A") for s in all_spectra_train})
-    val_unique = len({s.params.get("smiles", "N/A") for s in all_spectra_val})
-    test_unique = len({s.params.get("smiles", "N/A") for s in all_spectra_test})
+    train_unique = _count_unique_smiles(all_spectra_train)
+    val_unique = _count_unique_smiles(all_spectra_val)
+    test_unique = _count_unique_smiles(all_spectra_test)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@simba/workflows/preprocessing.py` around lines 175 - 181, Extract the
repeated set-comprehension into a small helper, e.g. add a function like
count_unique_smiles(spectra) that returns len({s.params.get("smiles", "N/A") for
s in spectra}), then replace the three callsites that assign train_unique,
val_unique, and test_unique (which currently use len({s.params.get("smiles",
"N/A") for s in ...})) with calls to that helper using all_spectra_train,
all_spectra_val, and all_spectra_test so missing-SMILES handling is centralized
and duplication removed.
197-219: Move rdkit import to module top and tighten the except clause.

Minor hygiene issues in this newly added diagnostic block:

from rdkit import Chem at Line 198 runs on every loop iteration. It's cached by Python, but the import belongs at module scope alongside the other imports.

At Lines 206-208, the except Exception is broad and the trailing pass is redundant after logger.warning(...). Note that Chem.MolFromSmiles typically returns None rather than raising for invalid SMILES, so this path will rarely execute — but when it does, swallowing every Exception hides real bugs (e.g., AttributeError on an unexpected smiles type).
♻️ Proposed refactor
@@ top of file
 from omegaconf import DictConfig
+from rdkit import Chem
@@ inside loop
-        # Log statistics about molecule sizes
-        from rdkit import Chem
-
         large_mols = []
         for smiles in df_smiles["canon_smiles"]:
-            try:
-                mol = Chem.MolFromSmiles(smiles)
-                if mol and mol.GetNumAtoms() > 40:
-                    large_mols.append((smiles, mol.GetNumAtoms()))
-            except Exception as e:
-                logger.warning(f"Error processing SMILES '{smiles}': {e}")
-                pass
+            mol = Chem.MolFromSmiles(smiles)
+            if mol is None:
+                logger.warning(f"Could not parse SMILES '{smiles}'")
+                continue
+            if mol.GetNumAtoms() > 40:
+                large_mols.append((smiles, mol.GetNumAtoms()))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@simba/workflows/preprocessing.py` around lines 197 - 219, Move the rdkit
import out of the loop to module scope (replace the in-loop "from rdkit import
Chem" with a top-of-file import) and tighten the exception handling around
Chem.MolFromSmiles: instead of "except Exception", validate the smiles type
(e.g., ensure it's a str) and catch only expected errors like TypeError or
ValueError when calling Chem.MolFromSmiles; remove the redundant "pass" after
logger.warning and keep the logger.warning(...) to report the exact error
(include the exception variable) so real bugs aren’t silently swallowed —
changes affect the block using Chem.MolFromSmiles, large_mols, and
logger.warning.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@simba/commands/preprocess.py`:
- Around line 106-123: Validation currently assumes bucket sets are non-empty
and n_buckets > 0; when train_buckets/val_buckets/test_buckets are all empty
max(all_buckets) raises ValueError, negative indices slip through, and n_buckets
<= 0 isn't checked leading to failures in _scaffold_bucket later. Fix by (1)
validating n_buckets is an int > 0 (check n_buckets <= 0 and abort with a clear
message), (2) require that at least one of train_b/val_b/test_b is non-empty
before calling max(...) (or guard max(...) with a conditional and provide a
clear error if all are empty), and (3) ensure all bucket indices in train_b,
val_b, test_b are within the inclusive range [0, n_buckets-1] (reject negatives
and >= n_buckets), keeping the existing overlap check; update error messages to
reference n_buckets and offending values so users can correct configs.

In `@simba/core/data/loaders/loaders.py`:
- Around line 118-163: The summary/logging block after the for-loop in
get_spectra is skipped when the caller (e.g., get_all_spectra_mgf) stops
iterating early; wrap the generator body in a try: ... finally: and move the
"MGF Loading Summary" and condition_fails aggregation into the finally so it
always runs on normal exhaustion or early close (GeneratorExit). Ensure the
finally reads the same counters/total_results variables used inside get_spectra
and preserves their values, and do not rely on code after the loop to emit the
logs. This guarantees the summary is emitted whether the generator is fully
consumed or garbage-collected by get_all_spectra_mgf.

In `@simba/core/training/train_utils.py`:
- Around line 212-214: The logger.info call that computes compression using
len(all_spectra) / len(unique_smiles) can raise ZeroDivisionError when
all_spectra (and thus unique_smiles) is empty; update the code around the
logger.info (the block using variables all_spectra and unique_smiles) to
short-circuit when not all_spectra (e.g., if not all_spectra: logger.info("No
spectra found.") or log counts with compression "N/A") so you never perform the
division, otherwise keep the existing formatted message when there is at least
one spectrum.
- Around line 113-119: The logging line in train_utils.py can raise
ZeroDivisionError when spectra is empty because total = len(spectra) may be 0;
update the block around logger.info so it checks if total == 0 and uses safe
values (e.g., 0 or "N/A") for the percentage calculations instead of performing
100 * len(...) / total; adjust the message construction for logger.info
(referencing total, spectrums_train, spectrums_val, spectrums_test,
train_buckets, val_buckets, test_buckets) to use the guarded percentages so no
division occurs when total is zero.

In `@simba/workflows/preprocessing.py`:
- Around line 148-151: The current unique_molecules_total uses a sentinel "N/A"
which collapses all missing SMILES into one fake molecule; change the set
comprehension to only include truthy SMILES (e.g., set(s.params.get("smiles")
for s in all_spectra if s.params.get("smiles"))) and compute a missing_smiles
count (e.g., sum(1 for s in all_spectra if not s.params.get("smiles"))) then
update the logger.info call to report both Loaded {len(all_spectra)} spectra,
{unique_molecules_total} unique molecules, and {missing_smiles} spectra with
missing SMILES; apply the same filtering and missing count logic to the
per-split unique-molecule calculations that mirror this pattern.
- Around line 273-289: The call to _Chem.CanonSmiles(s) in the preprocessing
loop is unguarded and will raise an ArgumentError on invalid SMILES (making the
else branch unreachable); wrap the call in a try/except that catches the RDKit
error, falls back to adding the original s to canon_smiles, and emits a warning
log (use the module logger or processLogger). After building canon_smiles,
ensure filter_cache_by_smiles will match precomputed_cache keys by either
passing both canonical and raw SMILES (e.g., combine canon_smiles and original
all_smiles) or detect the cache key type and normalize the list accordingly
before calling filter_cache_by_smiles so cache hits are preserved.

In `@tools/compare_preprocessing_runs.py`:
- Around line 93-101: The current use of frozenset([s0, s1]) as the dict key
collapses self-pairs and hides duplicate/conflicting entries; change the key to
a stable ordered identifier (e.g., (s0, s1) or include indices like (i, j, s0,
s1)) in the loop that builds pairs so self-pairs stay distinct and order is
preserved, and add a duplicate-detection check: when the key already exists in
pairs compare the existing (ed, mces) tuple to the new (float(row[2]),
float(row[3])) and if they differ, log or raise an error/report the conflict
(use the variables pairs, row, i, j, s0, s1, and the stored value) so upstream
chunk-overlap or inconsistent data is surfaced rather than silently overwritten.
- Around line 31-41: The load_pickle function currently picks the first match
from p.glob("*.pkl") which is non-deterministic; change it to use a
deterministic selection and fail loudly when ambiguous: after collecting
candidates from Path.glob, sort the candidate paths (e.g., sorted(candidates))
and if more than one .pkl candidate exists raise a clear FileExistsError (or
similar) asking the caller to disambiguate rather than silently using
candidates[0]; keep the existing behavior of preferring
"mapping_unique_smiles.pkl" when present and only apply deterministic sorting /
ambiguity check to the fallback branch.
- Around line 153-170: The current check only compares keys
(only_in_r1/only_in_r2) and misses cases where a pair exists in both runs but
with different ED/MCES values; update the assertion to also compute common_keys
= set(pairs1.keys()) & set(pairs2.keys()) and iterate those keys to compare
(pairs1[key] vs pairs2[key]), and if any ED or MCES differ set
any_assertion_failed = True and print a clear example (reusing the existing
print formatting for ED/MCES); keep the existing key-presence checks
(only_in_r1/only_in_r2) but ensure mismatched values trigger the failure path so
the script exits non-zero when distances differ.

---

Nitpick comments:
In `@simba/core/data/loaders/loaders.py`:
- Around line 581-583: Replace the use of the raw params dict when counting
unique molecules: in the block that logs loaded spectra (variable spectra),
compute unique_molecules from the SpectrumExt.smiles attribute rather than
s.params["smiles"] and remove the "smiles" in s.params guard because
SpectrumExt.smiles is normalized (empty string when missing); update the
logger.info call that reports Unique molecules to use this new unique_molecules
value (references: SpectrumExt.smiles, spectra, unique_molecules, logger.info).

In `@simba/core/training/train_utils.py`:
- Around line 44-54: Hoist the import of hashlib to the module top and add
validation in _scaffold_bucket (function name: _scaffold_bucket) to guard
against non-positive bucket counts: check that n_buckets is an int > 0 (raise
ValueError or assert with a clear message) before computing the digest, then
compute the SHA-256 hex digest and return int(digest, 16) % n_buckets as before;
this avoids repeated local imports in hot loops and prevents a cryptic
ZeroDivisionError when n_buckets <= 0.

In `@simba/workflows/preprocessing.py`:
- Around line 175-181: Extract the repeated set-comprehension into a small
helper, e.g. add a function like count_unique_smiles(spectra) that returns
len({s.params.get("smiles", "N/A") for s in spectra}), then replace the three
callsites that assign train_unique, val_unique, and test_unique (which currently
use len({s.params.get("smiles", "N/A") for s in ...})) with calls to that helper
using all_spectra_train, all_spectra_val, and all_spectra_test so missing-SMILES
handling is centralized and duplication removed.
- Around line 197-219: Move the rdkit import out of the loop to module scope
(replace the in-loop "from rdkit import Chem" with a top-of-file import) and
tighten the exception handling around Chem.MolFromSmiles: instead of "except
Exception", validate the smiles type (e.g., ensure it's a str) and catch only
expected errors like TypeError or ValueError when calling Chem.MolFromSmiles;
remove the redundant "pass" after logger.warning and keep the
logger.warning(...) to report the exact error (include the exception variable)
so real bugs aren’t silently swallowed — changes affect the block using
Chem.MolFromSmiles, large_mols, and logger.warning.

In `@simba/workflows/utils.py`:
- Around line 68-113: The logs recompute unique molecule counts and conflate
missing SMILES into a single "N/A" bucket; update the preprocessing logging to
compute unique_molecules_original once from all_spectra (store and reuse instead
of recomputing), compute unique_molecules_filtered similarly but exclude the
fallback value (filter out params.get("smiles") values that are None or equal to
the fallback "N/A" before taking set) so missing-SMILES spectra aren't merged
into one counted molecule, and remove the redundant simple log of
len(filtered_spectra) (the detailed Filtering summary already reports remaining
counts); look for usages of all_spectra, filtered_spectra,
unique_molecules_original, unique_molecules_filtered, and Preprocessor to apply
these changes.

In `@tests/integration/test_training_pipeline.py`:
- Around line 49-74: The test TrainUtils.train_val_test_split_bms usage is
brittle because specific bucket lists ([2], [4], [9]) rely on SHA256 of scaffold
strings; update the test to make it deterministic and robust by either (A)
selecting bucket ranges that guarantee non-empty picks for 5 distinct scaffolds
(e.g., use contiguous bucket ranges like train_buckets=range(0,4),
val_buckets=range(4,7), test_buckets=range(7,10) or equivalent lists) so
pigeonhole ensures non-empty splits given n_buckets=10, or (B) replace the loose
>0 checks with exact expected sizes for train/val/test computed from the current
scaffold set (calculate expected lengths once and assert equality against train,
val, test). Modify the call site in the test
(TrainUtils.train_val_test_split_bms invocation and the subsequent assertions on
train, val, test) accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 06efa8e6-93ac-453a-9f92-0239bce622d3

📥 Commits

Reviewing files that changed from the base of the PR and between b400dba and a35d6f6.

📒 Files selected for processing (11)

README.md
simba/commands/preprocess.py
simba/configs/preprocessing/default.yaml
simba/configs/preprocessing/fast_dev.yaml
simba/core/data/loaders/loaders.py
simba/core/training/train_utils.py
simba/workflows/preprocessing.py
simba/workflows/utils.py
tests/integration/test_training_pipeline.py
tests/unit/test_train_utils.py
tools/compare_preprocessing_runs.py

💤 Files with no reviewable changes (1)

simba/configs/preprocessing/fast_dev.yaml

coderabbitai · 2026-04-17T15:13:47Z

+    n_buckets = cfg.preprocessing.n_buckets
+    train_b = set(cfg.preprocessing.train_buckets)
+    val_b = set(cfg.preprocessing.val_buckets)
+    test_b = set(cfg.preprocessing.test_buckets)
+    all_buckets = train_b | val_b | test_b
+    if len(train_b & val_b) or len(train_b & test_b) or len(val_b & test_b):
        click.echo(
-            "❌ Error: val_split + test_split must be less than 1.0",
+            "Error: train_buckets, val_buckets and test_buckets must not overlap",
+            err=True,
+        )
+        raise click.Abort()
+    if max(all_buckets) >= n_buckets:
+        click.echo(
+            f"Error: bucket numbers must be in [0, {n_buckets}), "
+            f"got max={max(all_buckets)}",
            err=True,
        )
        raise click.Abort()


⚠️ Potential issue | 🟡 Minor

Validation crashes when all bucket lists are empty, and misses negative-index checks.

If a user (by mistake) passes empty lists for all three bucket configs, all_buckets = set() and max(all_buckets) raises ValueError: max() arg is an empty sequence, producing an unhelpful traceback instead of a clear CLI error.

The validation only checks the upper bound. Negative bucket indices pass through silently: they won't match any scaffold's computed bucket (which is always in [0, n_buckets)), so negative entries in val_buckets/test_buckets silently route those spectra to train.

n_buckets <= 0 is similarly not validated; it would blow up later in _scaffold_bucket with ZeroDivisionError.

🛡️ Suggested tightened validation

n_buckets = cfg.preprocessing.n_buckets + if n_buckets <= 0: + click.echo(f"Error: n_buckets must be positive, got {n_buckets}", err=True) + raise click.Abort() train_b = set(cfg.preprocessing.train_buckets) val_b = set(cfg.preprocessing.val_buckets) test_b = set(cfg.preprocessing.test_buckets) all_buckets = train_b | val_b | test_b + if not all_buckets: + click.echo( + "Error: at least one of train_buckets/val_buckets/test_buckets must be non-empty", + err=True, + ) + raise click.Abort() if len(train_b & val_b) or len(train_b & test_b) or len(val_b & test_b): click.echo( "Error: train_buckets, val_buckets and test_buckets must not overlap", err=True, ) raise click.Abort() - if max(all_buckets) >= n_buckets: + if max(all_buckets) >= n_buckets or min(all_buckets) < 0: click.echo( f"Error: bucket numbers must be in [0, {n_buckets}), " - f"got max={max(all_buckets)}", + f"got min={min(all_buckets)}, max={max(all_buckets)}", err=True, ) raise click.Abort()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@simba/commands/preprocess.py` around lines 106 - 123, Validation currently assumes bucket sets are non-empty and n_buckets > 0; when train_buckets/val_buckets/test_buckets are all empty max(all_buckets) raises ValueError, negative indices slip through, and n_buckets <= 0 isn't checked leading to failures in _scaffold_bucket later. Fix by (1) validating n_buckets is an int > 0 (check n_buckets <= 0 and abort with a clear message), (2) require that at least one of train_b/val_b/test_b is non-empty before calling max(...) (or guard max(...) with a conditional and provide a clear error if all are empty), and (3) ensure all bucket indices in train_b, val_b, test_b are within the inclusive range [0, n_buckets-1] (reject negatives and >= n_buckets), keeping the existing overlap check; update error messages to reference n_buckets and offending values so users can correct configs.

coderabbitai · 2026-04-17T15:13:47Z

+            logger.info(
+                f"MGF Loading Summary: Processed {spectra_processed} spectra - "
+                f"Passed validation: {spectra_yielded + spectra_failed_parsing + spectra_failed_nan_check}, "
+                f"Failed validation: {spectra_failed_validation}, "
+                f"Failed parsing: {spectra_failed_parsing}, "
+                f"Failed NaN check: {spectra_failed_nan_check}, "
+                f"Final yielded: {spectra_yielded}"
+            )
+
+            # Aggregate condition breakdown from all processed spectra
+            if total_results and len(total_results) > 0:
+                condition_fails = {
+                    "library_quality": 0,
+                    "charge": 0,
+                    "pepmass": 0,
+                    "mz_array": 0,
+                    "ion_mode": 0,
+                    "name_format": 0,
+                    "centroid": 0,
+                    "inchi_smiles": 0,
+                }
+
+                for res in total_results:
+                    if not res.get("cond_library", True):
+                        condition_fails["library_quality"] += 1
+                    if not res.get("cond_charge", True):
+                        condition_fails["charge"] += 1
+                    if not res.get("cond_pepmass", True):
+                        condition_fails["pepmass"] += 1
+                    if not res.get("cond_mz_array", True):
+                        condition_fails["mz_array"] += 1
+                    if not res.get("cond_ion_mode", True):
+                        condition_fails["ion_mode"] += 1
+                    if not res.get("cond_name", True):
+                        condition_fails["name_format"] += 1
+                    if not res.get("cond_centroid", True):
+                        condition_fails["centroid"] += 1
+                    if not res.get("cond_inchi_smiles", True):
+                        condition_fails["inchi_smiles"] += 1
+
+                # Log conditions causing most failures
+                logger.info("Validation condition failures (spectra rejected for each condition):")
+                for condition, count in sorted(condition_fails.items(), key=lambda x: x[1], reverse=True):
+                    if count > 0:
+                        pct = 100.0 * count / spectra_processed
+                        logger.info(f"  {condition}: {count} spectra ({pct:.1f}%)")


⚠️ Potential issue | 🟡 Minor

Summary log is skipped when caller stops iterating early.

get_spectra is a generator, and the MGF Loading Summary / condition-breakdown logs sit after the for mgf_index, spectrum in spectrum_it(): loop. They only execute if the generator is exhausted. In get_all_spectra_mgf (Line 571–578) the caller iterates up to num_samples and then simply drops the generator reference; if the cap is reached before EOF, the underlying generator never resumes and the summary is never emitted — which defeats the point of the added observability for any capped run.

Options:

Move the accounting into get_all_spectra_mgf (pass counters by reference, or collect them after a try/finally that runs on GeneratorExit).

Use a try: ... finally: inside get_spectra so the summary is logged on both exhaustion and early close.

🛡️ Sketch using try/finally

- for mgf_index, spectrum in spectrum_it(): - spectra_processed += 1 - ... - - logger.info( - f"MGF Loading Summary: ..." - ) + try: + for mgf_index, spectrum in spectrum_it(): + spectra_processed += 1 + ... + finally: + logger.info( + f"MGF Loading Summary: ..." + ) + # condition breakdown block ...

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@simba/core/data/loaders/loaders.py` around lines 118 - 163, The summary/logging block after the for-loop in get_spectra is skipped when the caller (e.g., get_all_spectra_mgf) stops iterating early; wrap the generator body in a try: ... finally: and move the "MGF Loading Summary" and condition_fails aggregation into the finally so it always runs on normal exhaustion or early close (GeneratorExit). Ensure the finally reads the same counters/total_results variables used inside get_spectra and preserves their values, and do not rely on code after the loop to emit the logs. This guarantees the summary is emitted whether the generator is fully consumed or garbage-collected by get_all_spectra_mgf.

coderabbitai · 2026-04-17T15:13:47Z

+        total = len(spectra)
+        logger.info(
+            f"Split sizes – Train: {len(spectrums_train)} ({100 * len(spectrums_train) / total:.1f}%), "
+            f"Val: {len(spectrums_val)} ({100 * len(spectrums_val) / total:.1f}%), "
+            f"Test: {len(spectrums_test)} ({100 * len(spectrums_test) / total:.1f}%) "
+            f"[train buckets: {sorted(train_buckets)}, val: {sorted(val_buckets)}, test: {sorted(test_buckets)}]"
+        )


⚠️ Potential issue | 🟡 Minor

ZeroDivisionError when spectra is empty.

If spectra is an empty list, total = 0 and the percentage computations in the log line (100 * len(...) / total) will raise ZeroDivisionError. Guard the logging when total == 0.

🛡️ Suggested guard

- total = len(spectra) - logger.info( - f"Split sizes – Train: {len(spectrums_train)} ({100 * len(spectrums_train) / total:.1f}%), " - f"Val: {len(spectrums_val)} ({100 * len(spectrums_val) / total:.1f}%), " - f"Test: {len(spectrums_test)} ({100 * len(spectrums_test) / total:.1f}%) " - f"[train buckets: {sorted(train_buckets)}, val: {sorted(val_buckets)}, test: {sorted(test_buckets)}]" - ) + total = len(spectra) + if total: + logger.info( + f"Split sizes – Train: {len(spectrums_train)} ({100 * len(spectrums_train) / total:.1f}%), " + f"Val: {len(spectrums_val)} ({100 * len(spectrums_val) / total:.1f}%), " + f"Test: {len(spectrums_test)} ({100 * len(spectrums_test) / total:.1f}%) " + f"[train buckets: {sorted(train_buckets)}, val: {sorted(val_buckets)}, test: {sorted(test_buckets)}]" + ) + else: + logger.info("Split called on empty spectra list")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@simba/core/training/train_utils.py` around lines 113 - 119, The logging line in train_utils.py can raise ZeroDivisionError when spectra is empty because total = len(spectra) may be 0; update the block around logger.info so it checks if total == 0 and uses safe values (e.g., 0 or "N/A") for the percentage calculations instead of performing 100 * len(...) / total; adjust the message construction for logger.info (referencing total, spectrums_train, spectrums_val, spectrums_test, train_buckets, val_buckets, test_buckets) to use the guarded percentages so no division occurs when total is zero.

coderabbitai · 2026-04-17T15:13:47Z

+        logger.info(
+            f"Found {len(unique_smiles)} unique SMILES from {len(all_spectra)} spectra (compression: {len(all_spectra) / len(unique_smiles):.2f}x)"
+        )


⚠️ Potential issue | 🟡 Minor

ZeroDivisionError on empty input.

len(all_spectra) / len(unique_smiles) will divide by zero when all_spectra is empty (so unique_smiles is also empty). Worth short-circuiting the log when there are no spectra.

🛡️ Suggested guard

- logger.info( - f"Found {len(unique_smiles)} unique SMILES from {len(all_spectra)} spectra (compression: {len(all_spectra) / len(unique_smiles):.2f}x)" - ) + if len(unique_smiles): + logger.info( + f"Found {len(unique_smiles)} unique SMILES from {len(all_spectra)} spectra " + f"(compression: {len(all_spectra) / len(unique_smiles):.2f}x)" + ) + else: + logger.info(f"Found 0 unique SMILES from {len(all_spectra)} spectra")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

logger.info(

f"Found {len(unique_smiles)} unique SMILES from {len(all_spectra)} spectra (compression: {len(all_spectra) / len(unique_smiles):.2f}x)"

)

if len(unique_smiles):

logger.info(

f"Found {len(unique_smiles)} unique SMILES from {len(all_spectra)} spectra "

f"(compression: {len(all_spectra) / len(unique_smiles):.2f}x)"

)

else:

logger.info(f"Found 0 unique SMILES from {len(all_spectra)} spectra")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@simba/core/training/train_utils.py` around lines 212 - 214, The logger.info call that computes compression using len(all_spectra) / len(unique_smiles) can raise ZeroDivisionError when all_spectra (and thus unique_smiles) is empty; update the code around the logger.info (the block using variables all_spectra and unique_smiles) to short-circuit when not all_spectra (e.g., if not all_spectra: logger.info("No spectra found.") or log counts with compression "N/A") so you never perform the division, otherwise keep the existing formatted message when there is at least one spectrum.

coderabbitai · 2026-04-17T15:13:48Z

+    unique_molecules_total = len({s.params.get("smiles", "N/A") for s in all_spectra})
+    logger.info(
+        f"Loaded {len(all_spectra)} spectra with {unique_molecules_total} unique molecules"
+    )


⚠️ Potential issue | 🟡 Minor

Unique-molecule count may be inflated by missing SMILES.

Using "N/A" as the default for spectra lacking a smiles param means every such spectrum collapses into a single synthetic "unique molecule", silently inflating unique_molecules_total by up to 1 and masking data-quality issues. Consider filtering out falsy values instead and/or logging the count of spectra without SMILES. The same pattern is repeated at Lines 176-178 for the per-split counts.

🔧 Proposed fix

- unique_molecules_total = len({s.params.get("smiles", "N/A") for s in all_spectra}) + smiles_set = {s.params.get("smiles") for s in all_spectra} + n_missing = sum(1 for v in smiles_set if not v) + smiles_set.discard(None) + smiles_set.discard("") + unique_molecules_total = len(smiles_set) + if n_missing: + logger.warning(f"{n_missing} spectra missing SMILES were excluded from unique count") logger.info( f"Loaded {len(all_spectra)} spectra with {unique_molecules_total} unique molecules" )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@simba/workflows/preprocessing.py` around lines 148 - 151, The current unique_molecules_total uses a sentinel "N/A" which collapses all missing SMILES into one fake molecule; change the set comprehension to only include truthy SMILES (e.g., set(s.params.get("smiles") for s in all_spectra if s.params.get("smiles"))) and compute a missing_smiles count (e.g., sum(1 for s in all_spectra if not s.params.get("smiles"))) then update the logger.info call to report both Loaded {len(all_spectra)} spectra, {unique_molecules_total} unique molecules, and {missing_smiles} spectra with missing SMILES; apply the same filtering and missing count logic to the per-split unique-molecule calculations that mirror this pattern.

coderabbitai · 2026-04-17T15:13:48Z

+                    from rdkit import Chem as _Chem
+
                    from simba.core.chemistry.edit_distance.edit_distance import (
                        filter_cache_by_smiles,
                    )

+                    canon_smiles = set()
+                    for s in all_smiles:
+                        canon = _Chem.CanonSmiles(s)
+                        if canon:
+                            canon_smiles.add(canon)
+                        else:
+                            canon_smiles.add(s)
+
                    precomputed_cache = filter_cache_by_smiles(
-                        precomputed_cache, list(all_smiles)
+                        precomputed_cache, list(canon_smiles)
                    )


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Confirm how filter_cache_by_smiles interprets its keys and how keys were produced. fd -t f edit_distance.py rg -nP -C4 '\bdef\s+filter_cache_by_smiles\b' rg -nP -C3 '\bload_precomputed_distances_cache\b' rg -nP -C3 'CanonSmiles|canon_smiles' --type=py

Repository: bittremieuxlab/simba

Length of output: 31215

🏁 Script executed:

cd simba && head -60 core/chemistry/edit_distance/edit_distance.py

Repository: bittremieuxlab/simba

Length of output: 1748

🏁 Script executed:

rg -nP -A15 '\bdef\s+filter_cache_by_smiles\b' simba/core/chemistry/edit_distance/edit_distance.py

Repository: bittremieuxlab/simba

Length of output: 609

🏁 Script executed:

rg -nP -A30 '\bdef\s+load_precomputed_distances_cache\b' simba/core/chemistry/edit_distance/edit_distance.py

Repository: bittremieuxlab/simba

Length of output: 1156

🏁 Script executed:

rg -nP 'cache\[|cache\.get|cache\.' simba/core/chemistry/edit_distance/edit_distance.py | head -20

Repository: bittremieuxlab/simba

Length of output: 303

🏁 Script executed:

rg -nP -A100 '\bdef\s+load_precomputed_distances_cache\b' simba/core/chemistry/edit_distance/edit_distance.py | head -120

Repository: bittremieuxlab/simba

Length of output: 3980

🏁 Script executed:

rg -nP -A150 '\bdef\s+load_precomputed_distances_cache\b' simba/core/chemistry/edit_distance/edit_distance.py | tail -80

Repository: bittremieuxlab/simba

Length of output: 3367

🏁 Script executed:

rg -nP -A50 '\bdef\s+_load_distances_from_files\b' simba/core/chemistry/edit_distance/edit_distance.py

Repository: bittremieuxlab/simba

Length of output: 1732

🌐 Web query:

RDKit Chem.CanonSmiles invalid SMILES behavior raises exception

💡 Result:

Yes, RDKit's Chem.CanonSmiles raises an exception when given an invalid SMILES string. It internally calls Chem.MolFromSmiles(smi), which returns None for invalid SMILES. Then, Chem.MolToSmiles(None) triggers an ArgumentError because no C++ Mol object is provided. This is shown in the source code and Python wrapper: CanonSmiles(smi) does m = MolFromSmiles(smi); return MolToSmiles(m, ...). If parsing fails, m is None, causing the exception on MolToSmiles. Examples from GitHub issues confirm this: invalid inputs lead to ArgumentError (e.g., issues #8670, #5443). MolFromSmiles itself returns None silently for parse failures (issue #22), but CanonSmiles propagates the error via exception. To handle safely, check if mol = Chem.MolFromSmiles(smi) is not None before calling Chem.MolToSmiles(mol). Chem.CanonSmiles does not return a string or empty for invalid input; it raises.

Citations:

1: https://www.rdkit.org/docs/source/rdkit.Chem.html

2: https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/SmilesParse/SmilesWrite.cpp

3: Canonical smiles can't be kekulized by MolFromSmiles rdkit/rdkit#8670

4: Canonicalize valid, canonicalized SMILES raises ArgumenError exception rdkit/rdkit#5443

5: Can't parse SMILES strings rdkit/rdkit#3722

6: Smiles Parse Error after RDKit Canonicalization rdkit/rdkit#7696

7: cannot read SMILES which has been canonized rdkit/rdkit#2967

Unguarded CanonSmiles call will crash preprocessing on invalid SMILES.

Chem.CanonSmiles(s) raises an ArgumentError (it internally calls MolFromSmiles followed by MolToSmiles on the result) on invalid SMILES rather than returning an empty or falsy value. The current code has no exception handling, so a single invalid SMILES in the dataset will abort the entire preprocessing run. Additionally, the else: canon_smiles.add(s) branch at lines 284–285 is unreachable because the exception is raised before evaluation can continue.

Wrap the call in a try-except block to fall back to the original SMILES on canonicalization failure and log a warning. Also verify that filter_cache_by_smiles will match: the cache keys are built from whichever SMILES column exists in the preprocessing pickle (preferring canon_smiles if available, otherwise smiles). If the cache was created with raw SMILES keys but filtering now passes canonicalized SMILES, all entries will silently fail to match, causing zero cache hits without warning.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@simba/workflows/preprocessing.py` around lines 273 - 289, The call to _Chem.CanonSmiles(s) in the preprocessing loop is unguarded and will raise an ArgumentError on invalid SMILES (making the else branch unreachable); wrap the call in a try/except that catches the RDKit error, falls back to adding the original s to canon_smiles, and emits a warning log (use the module logger or processLogger). After building canon_smiles, ensure filter_cache_by_smiles will match precomputed_cache keys by either passing both canonical and raw SMILES (e.g., combine canon_smiles and original all_smiles) or detect the cache key type and normalize the list accordingly before calling filter_cache_by_smiles so cache hits are preserved.

coderabbitai · 2026-04-17T15:13:48Z

+def load_pickle(preprocessing_dir: str) -> dict:
+    p = Path(preprocessing_dir)
+    candidates = list(p.glob("mapping_unique_smiles.pkl"))
+    if not candidates:
+        candidates = list(p.glob("*.pkl"))
+    if not candidates:
+        raise FileNotFoundError(f"No pickle found in {preprocessing_dir}")
+    path = candidates[0]
+    print(f"  Loading {path}")
+    with open(path, "rb") as f:
+        return dill.load(f)


⚠️ Potential issue | 🟡 Minor

Non-deterministic fallback pickle selection.

When mapping_unique_smiles.pkl is absent, the fallback p.glob("*.pkl") returns results in arbitrary filesystem order and candidates[0] is used directly. On runs that produce multiple pickles (e.g., both "lightweight" and "full" format artifacts), the two directories being compared could silently load different representative files, producing misleading comparisons. Consider sorted(...) for determinism and/or failing loudly when multiple candidates exist.

♻️ Suggested change

- candidates = list(p.glob("mapping_unique_smiles.pkl")) - if not candidates: - candidates = list(p.glob("*.pkl")) + candidates = sorted(p.glob("mapping_unique_smiles.pkl")) + if not candidates: + candidates = sorted(p.glob("*.pkl")) if not candidates: raise FileNotFoundError(f"No pickle found in {preprocessing_dir}") + if len(candidates) > 1: + print(f" Warning: multiple pickles found, using {candidates[0].name}") path = candidates[0]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tools/compare_preprocessing_runs.py` around lines 31 - 41, The load_pickle function currently picks the first match from p.glob("*.pkl") which is non-deterministic; change it to use a deterministic selection and fail loudly when ambiguous: after collecting candidates from Path.glob, sort the candidate paths (e.g., sorted(candidates)) and if more than one .pkl candidate exists raise a clear FileExistsError (or similar) asking the caller to disambiguate rather than silently using candidates[0]; keep the existing behavior of preferring "mapping_unique_smiles.pkl" when present and only apply deterministic sorting / ambiguity check to the fallback branch.

coderabbitai · 2026-04-17T15:13:48Z

+    pairs = {}
+    for row in all_rows:
+        i, j = int(row[0]), int(row[1])
+        s0 = idx_to_smiles.get(i)
+        s1 = idx_to_smiles.get(j)
+        if s0 is None or s1 is None:
+            continue
+        pairs[frozenset([s0, s1])] = (float(row[2]), float(row[3]))
+    return pairs


⚠️ Potential issue | 🟡 Minor

frozenset pair key silently collapses self-pairs and loses multiplicity.

Using frozenset([s0, s1]) as the key has two footguns worth flagging:

A self-pair (i, i) (same SMILES on both sides) becomes a size-1 frozenset, which will still compare equal to any other self-pair of the same molecule but cannot be distinguished from an ordered pair involving only that one molecule.

If the same {s0, s1} pair appears in multiple .npy chunks (or with both orderings), later rows silently overwrite earlier ones in pairs — so duplicate-detection and chunk-overlap bugs in the upstream pipeline won't surface here.

At minimum, consider detecting duplicate keys with conflicting (ed, mces) values during the build and reporting them, since that is exactly the kind of inconsistency this tool is meant to catch.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tools/compare_preprocessing_runs.py` around lines 93 - 101, The current use of frozenset([s0, s1]) as the dict key collapses self-pairs and hides duplicate/conflicting entries; change the key to a stable ordered identifier (e.g., (s0, s1) or include indices like (i, j, s0, s1)) in the loop that builds pairs so self-pairs stay distinct and order is preserved, and add a duplicate-detection check: when the key already exists in pairs compare the existing (ed, mces) tuple to the new (float(row[2]), float(row[3])) and if they differ, log or raise an error/report the conflict (use the variables pairs, row, i, j, s0, s1, and the stored value) so upstream chunk-overlap or inconsistent data is surfaced rather than silently overwritten.

coderabbitai · 2026-04-17T15:13:48Z

+        # Assertion: a pair of shared molecules must be in both or neither
+        if only_in_r1 or only_in_r2:
+            any_assertion_failed = True
+            print("\n  ❌ ASSERTION FAILED: inconsistent pairs for shared molecules!")
+            if only_in_r1:
+                example = next(iter(only_in_r1))
+                ed, mc = pairs1[example]
+                print(
+                    f"     Example only in run1: {sorted(example)}  ED={ed:.1f} MCES={mc:.1f}"
+                )
+            if only_in_r2:
+                example = next(iter(only_in_r2))
+                ed, mc = pairs2[example]
+                print(
+                    f"     Example only in run2: {sorted(example)}  ED={ed:.1f} MCES={mc:.1f}"
+                )
+        else:
+            print("\n  ✅ Assertion passed: all shared-molecule pairs are consistent.")


⚠️ Potential issue | 🟠 Major

Assertion does not verify ED/MCES equality as documented.

The module docstring (lines 15–16) states the assertion guarantees that shared-molecule pairs are present in both runs "with identical ED and MCES values". However, only_in_r1/only_in_r2 are computed purely via set membership on frozenset keys (lines 142–146), so two runs with the same pair keys but differing ED/MCES values will pass this check. The value mismatch is only surfaced visually via the ⚠️ marker in the sample loop (lines 179–181) and never flips any_assertion_failed, so the script will exit 0 despite inconsistent distances — defeating the primary purpose of the tool.

🛡️ Proposed fix: also assert value equality on common pairs

common_pairs = cross_pairs1 & cross_pairs2 only_in_r1 = cross_pairs1 - cross_pairs2 only_in_r2 = cross_pairs2 - cross_pairs1 + + # Value mismatches on pairs present in both runs + value_mismatches = [ + (p, pairs1[p], pairs2[p]) + for p in common_pairs + if abs(pairs1[p][0] - pairs2[p][0]) > 1e-6 + or abs(pairs1[p][1] - pairs2[p][1]) > 1e-6 + ] @@ - if only_in_r1 or only_in_r2: + if only_in_r1 or only_in_r2 or value_mismatches: any_assertion_failed = True print("\n ❌ ASSERTION FAILED: inconsistent pairs for shared molecules!") + if value_mismatches: + p, (ed1, mc1), (ed2, mc2) = value_mismatches[0] + print( + f" Example value mismatch: {sorted(p)} " + f"ED={ed1:.3f}/{ed2:.3f} MCES={mc1:.3f}/{mc2:.3f} " + f"(total {len(value_mismatches)})" + )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tools/compare_preprocessing_runs.py` around lines 153 - 170, The current check only compares keys (only_in_r1/only_in_r2) and misses cases where a pair exists in both runs but with different ED/MCES values; update the assertion to also compute common_keys = set(pairs1.keys()) & set(pairs2.keys()) and iterate those keys to compare (pairs1[key] vs pairs2[key]), and if any ED or MCES differ set any_assertion_failed = True and print a clear example (reusing the existing print formatting for ED/MCES); keep the existing key-presence checks (only_in_r1/only_in_r2) but ensure mismatched values trigger the failure path so the script exits non-zero when distances differ.

rukubrakov added 4 commits April 8, 2026 17:03

add more preprocessing logs

61740fe

fix: canonicalize SMILES before filtering precomputed distances cache

01756a5

feat: add preprocessing run consistency checker tool

d6b4542

feat: deterministic hash-based scaffold split with explicit bucket co…

a35d6f6

…nfig

rukubrakov requested a review from Janne98 April 17, 2026 15:09

coderabbitai bot reviewed Apr 17, 2026

View reviewed changes

fix: pin ruff to 0.15.11 and fix C401 set comprehension errors

83a7058

-        logger.info(
-            f"Found {len(unique_smiles)} unique SMILES from {len(all_spectra)} spectra (compression: {len(all_spectra) / len(unique_smiles):.2f}x)"
-        )
+        if len(unique_smiles):
+            logger.info(
+                f"Found {len(unique_smiles)} unique SMILES from {len(all_spectra)} spectra "
+                f"(compression: {len(all_spectra) / len(unique_smiles):.2f}x)"
+            )
+        else:
+            logger.info(f"Found 0 unique SMILES from {len(all_spectra)} spectra")

Conversation

rukubrakov commented Apr 17, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rukubrakov commented Apr 17, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 17, 2026 •

edited

Loading