Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 35 minutes and 24 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR replaces the ratio-based dataset split configuration ( Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (6)
simba/core/data/loaders/loaders.py (1)
581-583: Prefer theSpectrumExt.smilesattribute overparams["smiles"].
SpectrumExtalready exposes.smilesas a first-class attribute (used throughouttrain_utils.py, e.g.Chem.CanonSmiles(s.smiles)). Readings.params["smiles"]here couples this log to the raw pyteomics params dict and requires the"smiles" in s.paramsguard;s.smilesis already normalized to""when missing.♻️ Suggested simplification
- logger.info(f"Loaded {len(spectra)} spectra from MGF file (with filtering applied)") - unique_molecules = len(set(s.params["smiles"] for s in spectra if "smiles" in s.params)) - logger.info(f"Unique molecules in loaded spectra: {unique_molecules}") + logger.info(f"Loaded {len(spectra)} spectra from MGF file (with filtering applied)") + unique_molecules = len({s.smiles for s in spectra if s.smiles}) + logger.info(f"Unique molecules in loaded spectra: {unique_molecules}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@simba/core/data/loaders/loaders.py` around lines 581 - 583, Replace the use of the raw params dict when counting unique molecules: in the block that logs loaded spectra (variable spectra), compute unique_molecules from the SpectrumExt.smiles attribute rather than s.params["smiles"] and remove the "smiles" in s.params guard because SpectrumExt.smiles is normalized (empty string when missing); update the logger.info call that reports Unique molecules to use this new unique_molecules value (references: SpectrumExt.smiles, spectra, unique_molecules, logger.info).simba/core/training/train_utils.py (1)
44-54: Consider module-levelhashlibimport and guardingn_buckets.Two small refinements:
import hashlibcan be hoisted to the module top to avoid repeated local import lookups on every call (this is invoked once per scaffold in a hot loop).n_buckets <= 0will raiseZeroDivisionErrorat% n_buckets; a brief validation (or anassert) would produce a clearer error than the defaulttrain_buckets = list(range(n_buckets - 2))path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@simba/core/training/train_utils.py` around lines 44 - 54, Hoist the import of hashlib to the module top and add validation in _scaffold_bucket (function name: _scaffold_bucket) to guard against non-positive bucket counts: check that n_buckets is an int > 0 (raise ValueError or assert with a clear message) before computing the digest, then compute the SHA-256 hex digest and return int(digest, 16) % n_buckets as before; this avoids repeated local imports in hot loops and prevents a cryptic ZeroDivisionError when n_buckets <= 0.simba/workflows/utils.py (1)
68-113: Minor duplication and unique-molecule counting caveat.
unique_molecules_original(Line 103–105) recomputes the same set already logged at Line 70. Compute it once and reuse.s.params.get("smiles", "N/A")conflates every spectrum missing a SMILES into a single "N/A" bucket, so the reported unique molecule count can be understated for datasets where some spectra lack SMILES. Consider filtering out the fallback before takingset(...), or at least note in the log that missing-SMILES spectra are counted as one.- Line 99 logs
"{len(filtered_spectra)} spectra remaining after filtering."and Lines 110–113 log a more detailed summary covering the same information — consider dropping Line 99 to reduce log noise.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@simba/workflows/utils.py` around lines 68 - 113, The logs recompute unique molecule counts and conflate missing SMILES into a single "N/A" bucket; update the preprocessing logging to compute unique_molecules_original once from all_spectra (store and reuse instead of recomputing), compute unique_molecules_filtered similarly but exclude the fallback value (filter out params.get("smiles") values that are None or equal to the fallback "N/A" before taking set) so missing-SMILES spectra aren't merged into one counted molecule, and remove the redundant simple log of len(filtered_spectra) (the detailed Filtering summary already reports remaining counts); look for usages of all_spectra, filtered_spectra, unique_molecules_original, unique_molecules_filtered, and Preprocessor to apply these changes.tests/integration/test_training_pipeline.py (1)
49-74: Test is deterministic but brittle to scaffold string changes.With only 5 distinct scaffolds (
scaffold_0..scaffold_4) and selecting buckets[2],[4],[9]out of 10, each of val/test non-emptiness relies on SHA-256 of those specific strings falling into exactly those buckets. Any rename of the scaffold literals (or change ton_buckets) could silently break the> 0assertions. Consider either picking bucket sets that cover contiguous ranges (so the pigeonhole guarantees non-emptiness for 5 items) or asserting on exact expected split sizes that you pre-computed — this makes the intent explicit.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/test_training_pipeline.py` around lines 49 - 74, The test TrainUtils.train_val_test_split_bms usage is brittle because specific bucket lists ([2], [4], [9]) rely on SHA256 of scaffold strings; update the test to make it deterministic and robust by either (A) selecting bucket ranges that guarantee non-empty picks for 5 distinct scaffolds (e.g., use contiguous bucket ranges like train_buckets=range(0,4), val_buckets=range(4,7), test_buckets=range(7,10) or equivalent lists) so pigeonhole ensures non-empty splits given n_buckets=10, or (B) replace the loose >0 checks with exact expected sizes for train/val/test computed from the current scaffold set (calculate expected lengths once and assert equality against train, val, test). Modify the call site in the test (TrainUtils.train_val_test_split_bms invocation and the subsequent assertions on train, val, test) accordingly.simba/workflows/preprocessing.py (2)
175-181: Duplication with the block at Lines 148-151 — consider a small helper.The
len({s.params.get("smiles", "N/A") for s in ...})idiom is repeated four times in this function. Extracting a tiny helper would reduce duplication and make it trivial to fix the missing-SMILES handling in one place.♻️ Optional helper
+def _count_unique_smiles(spectra) -> int: + return len({s.params["smiles"] for s in spectra if s.params.get("smiles")})Then replace each callsite, e.g.:
- train_unique = len({s.params.get("smiles", "N/A") for s in all_spectra_train}) - val_unique = len({s.params.get("smiles", "N/A") for s in all_spectra_val}) - test_unique = len({s.params.get("smiles", "N/A") for s in all_spectra_test}) + train_unique = _count_unique_smiles(all_spectra_train) + val_unique = _count_unique_smiles(all_spectra_val) + test_unique = _count_unique_smiles(all_spectra_test)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@simba/workflows/preprocessing.py` around lines 175 - 181, Extract the repeated set-comprehension into a small helper, e.g. add a function like count_unique_smiles(spectra) that returns len({s.params.get("smiles", "N/A") for s in spectra}), then replace the three callsites that assign train_unique, val_unique, and test_unique (which currently use len({s.params.get("smiles", "N/A") for s in ...})) with calls to that helper using all_spectra_train, all_spectra_val, and all_spectra_test so missing-SMILES handling is centralized and duplication removed.
197-219: Moverdkitimport to module top and tighten the except clause.Minor hygiene issues in this newly added diagnostic block:
from rdkit import Chemat Line 198 runs on every loop iteration. It's cached by Python, but the import belongs at module scope alongside the other imports.- At Lines 206-208, the
except Exceptionis broad and the trailingpassis redundant afterlogger.warning(...). Note thatChem.MolFromSmilestypically returnsNonerather than raising for invalid SMILES, so this path will rarely execute — but when it does, swallowing everyExceptionhides real bugs (e.g.,AttributeErroron an unexpectedsmilestype).♻️ Proposed refactor
@@ top of file from omegaconf import DictConfig +from rdkit import Chem @@ inside loop - # Log statistics about molecule sizes - from rdkit import Chem - large_mols = [] for smiles in df_smiles["canon_smiles"]: - try: - mol = Chem.MolFromSmiles(smiles) - if mol and mol.GetNumAtoms() > 40: - large_mols.append((smiles, mol.GetNumAtoms())) - except Exception as e: - logger.warning(f"Error processing SMILES '{smiles}': {e}") - pass + mol = Chem.MolFromSmiles(smiles) + if mol is None: + logger.warning(f"Could not parse SMILES '{smiles}'") + continue + if mol.GetNumAtoms() > 40: + large_mols.append((smiles, mol.GetNumAtoms()))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@simba/workflows/preprocessing.py` around lines 197 - 219, Move the rdkit import out of the loop to module scope (replace the in-loop "from rdkit import Chem" with a top-of-file import) and tighten the exception handling around Chem.MolFromSmiles: instead of "except Exception", validate the smiles type (e.g., ensure it's a str) and catch only expected errors like TypeError or ValueError when calling Chem.MolFromSmiles; remove the redundant "pass" after logger.warning and keep the logger.warning(...) to report the exact error (include the exception variable) so real bugs aren’t silently swallowed — changes affect the block using Chem.MolFromSmiles, large_mols, and logger.warning.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@simba/commands/preprocess.py`:
- Around line 106-123: Validation currently assumes bucket sets are non-empty
and n_buckets > 0; when train_buckets/val_buckets/test_buckets are all empty
max(all_buckets) raises ValueError, negative indices slip through, and n_buckets
<= 0 isn't checked leading to failures in _scaffold_bucket later. Fix by (1)
validating n_buckets is an int > 0 (check n_buckets <= 0 and abort with a clear
message), (2) require that at least one of train_b/val_b/test_b is non-empty
before calling max(...) (or guard max(...) with a conditional and provide a
clear error if all are empty), and (3) ensure all bucket indices in train_b,
val_b, test_b are within the inclusive range [0, n_buckets-1] (reject negatives
and >= n_buckets), keeping the existing overlap check; update error messages to
reference n_buckets and offending values so users can correct configs.
In `@simba/core/data/loaders/loaders.py`:
- Around line 118-163: The summary/logging block after the for-loop in
get_spectra is skipped when the caller (e.g., get_all_spectra_mgf) stops
iterating early; wrap the generator body in a try: ... finally: and move the
"MGF Loading Summary" and condition_fails aggregation into the finally so it
always runs on normal exhaustion or early close (GeneratorExit). Ensure the
finally reads the same counters/total_results variables used inside get_spectra
and preserves their values, and do not rely on code after the loop to emit the
logs. This guarantees the summary is emitted whether the generator is fully
consumed or garbage-collected by get_all_spectra_mgf.
In `@simba/core/training/train_utils.py`:
- Around line 212-214: The logger.info call that computes compression using
len(all_spectra) / len(unique_smiles) can raise ZeroDivisionError when
all_spectra (and thus unique_smiles) is empty; update the code around the
logger.info (the block using variables all_spectra and unique_smiles) to
short-circuit when not all_spectra (e.g., if not all_spectra: logger.info("No
spectra found.") or log counts with compression "N/A") so you never perform the
division, otherwise keep the existing formatted message when there is at least
one spectrum.
- Around line 113-119: The logging line in train_utils.py can raise
ZeroDivisionError when spectra is empty because total = len(spectra) may be 0;
update the block around logger.info so it checks if total == 0 and uses safe
values (e.g., 0 or "N/A") for the percentage calculations instead of performing
100 * len(...) / total; adjust the message construction for logger.info
(referencing total, spectrums_train, spectrums_val, spectrums_test,
train_buckets, val_buckets, test_buckets) to use the guarded percentages so no
division occurs when total is zero.
In `@simba/workflows/preprocessing.py`:
- Around line 148-151: The current unique_molecules_total uses a sentinel "N/A"
which collapses all missing SMILES into one fake molecule; change the set
comprehension to only include truthy SMILES (e.g., set(s.params.get("smiles")
for s in all_spectra if s.params.get("smiles"))) and compute a missing_smiles
count (e.g., sum(1 for s in all_spectra if not s.params.get("smiles"))) then
update the logger.info call to report both Loaded {len(all_spectra)} spectra,
{unique_molecules_total} unique molecules, and {missing_smiles} spectra with
missing SMILES; apply the same filtering and missing count logic to the
per-split unique-molecule calculations that mirror this pattern.
- Around line 273-289: The call to _Chem.CanonSmiles(s) in the preprocessing
loop is unguarded and will raise an ArgumentError on invalid SMILES (making the
else branch unreachable); wrap the call in a try/except that catches the RDKit
error, falls back to adding the original s to canon_smiles, and emits a warning
log (use the module logger or processLogger). After building canon_smiles,
ensure filter_cache_by_smiles will match precomputed_cache keys by either
passing both canonical and raw SMILES (e.g., combine canon_smiles and original
all_smiles) or detect the cache key type and normalize the list accordingly
before calling filter_cache_by_smiles so cache hits are preserved.
In `@tools/compare_preprocessing_runs.py`:
- Around line 93-101: The current use of frozenset([s0, s1]) as the dict key
collapses self-pairs and hides duplicate/conflicting entries; change the key to
a stable ordered identifier (e.g., (s0, s1) or include indices like (i, j, s0,
s1)) in the loop that builds pairs so self-pairs stay distinct and order is
preserved, and add a duplicate-detection check: when the key already exists in
pairs compare the existing (ed, mces) tuple to the new (float(row[2]),
float(row[3])) and if they differ, log or raise an error/report the conflict
(use the variables pairs, row, i, j, s0, s1, and the stored value) so upstream
chunk-overlap or inconsistent data is surfaced rather than silently overwritten.
- Around line 31-41: The load_pickle function currently picks the first match
from p.glob("*.pkl") which is non-deterministic; change it to use a
deterministic selection and fail loudly when ambiguous: after collecting
candidates from Path.glob, sort the candidate paths (e.g., sorted(candidates))
and if more than one .pkl candidate exists raise a clear FileExistsError (or
similar) asking the caller to disambiguate rather than silently using
candidates[0]; keep the existing behavior of preferring
"mapping_unique_smiles.pkl" when present and only apply deterministic sorting /
ambiguity check to the fallback branch.
- Around line 153-170: The current check only compares keys
(only_in_r1/only_in_r2) and misses cases where a pair exists in both runs but
with different ED/MCES values; update the assertion to also compute common_keys
= set(pairs1.keys()) & set(pairs2.keys()) and iterate those keys to compare
(pairs1[key] vs pairs2[key]), and if any ED or MCES differ set
any_assertion_failed = True and print a clear example (reusing the existing
print formatting for ED/MCES); keep the existing key-presence checks
(only_in_r1/only_in_r2) but ensure mismatched values trigger the failure path so
the script exits non-zero when distances differ.
---
Nitpick comments:
In `@simba/core/data/loaders/loaders.py`:
- Around line 581-583: Replace the use of the raw params dict when counting
unique molecules: in the block that logs loaded spectra (variable spectra),
compute unique_molecules from the SpectrumExt.smiles attribute rather than
s.params["smiles"] and remove the "smiles" in s.params guard because
SpectrumExt.smiles is normalized (empty string when missing); update the
logger.info call that reports Unique molecules to use this new unique_molecules
value (references: SpectrumExt.smiles, spectra, unique_molecules, logger.info).
In `@simba/core/training/train_utils.py`:
- Around line 44-54: Hoist the import of hashlib to the module top and add
validation in _scaffold_bucket (function name: _scaffold_bucket) to guard
against non-positive bucket counts: check that n_buckets is an int > 0 (raise
ValueError or assert with a clear message) before computing the digest, then
compute the SHA-256 hex digest and return int(digest, 16) % n_buckets as before;
this avoids repeated local imports in hot loops and prevents a cryptic
ZeroDivisionError when n_buckets <= 0.
In `@simba/workflows/preprocessing.py`:
- Around line 175-181: Extract the repeated set-comprehension into a small
helper, e.g. add a function like count_unique_smiles(spectra) that returns
len({s.params.get("smiles", "N/A") for s in spectra}), then replace the three
callsites that assign train_unique, val_unique, and test_unique (which currently
use len({s.params.get("smiles", "N/A") for s in ...})) with calls to that helper
using all_spectra_train, all_spectra_val, and all_spectra_test so missing-SMILES
handling is centralized and duplication removed.
- Around line 197-219: Move the rdkit import out of the loop to module scope
(replace the in-loop "from rdkit import Chem" with a top-of-file import) and
tighten the exception handling around Chem.MolFromSmiles: instead of "except
Exception", validate the smiles type (e.g., ensure it's a str) and catch only
expected errors like TypeError or ValueError when calling Chem.MolFromSmiles;
remove the redundant "pass" after logger.warning and keep the
logger.warning(...) to report the exact error (include the exception variable)
so real bugs aren’t silently swallowed — changes affect the block using
Chem.MolFromSmiles, large_mols, and logger.warning.
In `@simba/workflows/utils.py`:
- Around line 68-113: The logs recompute unique molecule counts and conflate
missing SMILES into a single "N/A" bucket; update the preprocessing logging to
compute unique_molecules_original once from all_spectra (store and reuse instead
of recomputing), compute unique_molecules_filtered similarly but exclude the
fallback value (filter out params.get("smiles") values that are None or equal to
the fallback "N/A" before taking set) so missing-SMILES spectra aren't merged
into one counted molecule, and remove the redundant simple log of
len(filtered_spectra) (the detailed Filtering summary already reports remaining
counts); look for usages of all_spectra, filtered_spectra,
unique_molecules_original, unique_molecules_filtered, and Preprocessor to apply
these changes.
In `@tests/integration/test_training_pipeline.py`:
- Around line 49-74: The test TrainUtils.train_val_test_split_bms usage is
brittle because specific bucket lists ([2], [4], [9]) rely on SHA256 of scaffold
strings; update the test to make it deterministic and robust by either (A)
selecting bucket ranges that guarantee non-empty picks for 5 distinct scaffolds
(e.g., use contiguous bucket ranges like train_buckets=range(0,4),
val_buckets=range(4,7), test_buckets=range(7,10) or equivalent lists) so
pigeonhole ensures non-empty splits given n_buckets=10, or (B) replace the loose
>0 checks with exact expected sizes for train/val/test computed from the current
scaffold set (calculate expected lengths once and assert equality against train,
val, test). Modify the call site in the test
(TrainUtils.train_val_test_split_bms invocation and the subsequent assertions on
train, val, test) accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 06efa8e6-93ac-453a-9f92-0239bce622d3
📒 Files selected for processing (11)
README.mdsimba/commands/preprocess.pysimba/configs/preprocessing/default.yamlsimba/configs/preprocessing/fast_dev.yamlsimba/core/data/loaders/loaders.pysimba/core/training/train_utils.pysimba/workflows/preprocessing.pysimba/workflows/utils.pytests/integration/test_training_pipeline.pytests/unit/test_train_utils.pytools/compare_preprocessing_runs.py
💤 Files with no reviewable changes (1)
- simba/configs/preprocessing/fast_dev.yaml
| n_buckets = cfg.preprocessing.n_buckets | ||
| train_b = set(cfg.preprocessing.train_buckets) | ||
| val_b = set(cfg.preprocessing.val_buckets) | ||
| test_b = set(cfg.preprocessing.test_buckets) | ||
| all_buckets = train_b | val_b | test_b | ||
| if len(train_b & val_b) or len(train_b & test_b) or len(val_b & test_b): | ||
| click.echo( | ||
| "❌ Error: val_split + test_split must be less than 1.0", | ||
| "Error: train_buckets, val_buckets and test_buckets must not overlap", | ||
| err=True, | ||
| ) | ||
| raise click.Abort() | ||
| if max(all_buckets) >= n_buckets: | ||
| click.echo( | ||
| f"Error: bucket numbers must be in [0, {n_buckets}), " | ||
| f"got max={max(all_buckets)}", | ||
| err=True, | ||
| ) | ||
| raise click.Abort() |
There was a problem hiding this comment.
Validation crashes when all bucket lists are empty, and misses negative-index checks.
- If a user (by mistake) passes empty lists for all three bucket configs,
all_buckets = set()andmax(all_buckets)raisesValueError: max() arg is an empty sequence, producing an unhelpful traceback instead of a clear CLI error. - The validation only checks the upper bound. Negative bucket indices pass through silently: they won't match any scaffold's computed bucket (which is always in
[0, n_buckets)), so negative entries inval_buckets/test_bucketssilently route those spectra to train. n_buckets <= 0is similarly not validated; it would blow up later in_scaffold_bucketwithZeroDivisionError.
🛡️ Suggested tightened validation
n_buckets = cfg.preprocessing.n_buckets
+ if n_buckets <= 0:
+ click.echo(f"Error: n_buckets must be positive, got {n_buckets}", err=True)
+ raise click.Abort()
train_b = set(cfg.preprocessing.train_buckets)
val_b = set(cfg.preprocessing.val_buckets)
test_b = set(cfg.preprocessing.test_buckets)
all_buckets = train_b | val_b | test_b
+ if not all_buckets:
+ click.echo(
+ "Error: at least one of train_buckets/val_buckets/test_buckets must be non-empty",
+ err=True,
+ )
+ raise click.Abort()
if len(train_b & val_b) or len(train_b & test_b) or len(val_b & test_b):
click.echo(
"Error: train_buckets, val_buckets and test_buckets must not overlap",
err=True,
)
raise click.Abort()
- if max(all_buckets) >= n_buckets:
+ if max(all_buckets) >= n_buckets or min(all_buckets) < 0:
click.echo(
f"Error: bucket numbers must be in [0, {n_buckets}), "
- f"got max={max(all_buckets)}",
+ f"got min={min(all_buckets)}, max={max(all_buckets)}",
err=True,
)
raise click.Abort()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@simba/commands/preprocess.py` around lines 106 - 123, Validation currently
assumes bucket sets are non-empty and n_buckets > 0; when
train_buckets/val_buckets/test_buckets are all empty max(all_buckets) raises
ValueError, negative indices slip through, and n_buckets <= 0 isn't checked
leading to failures in _scaffold_bucket later. Fix by (1) validating n_buckets
is an int > 0 (check n_buckets <= 0 and abort with a clear message), (2) require
that at least one of train_b/val_b/test_b is non-empty before calling max(...)
(or guard max(...) with a conditional and provide a clear error if all are
empty), and (3) ensure all bucket indices in train_b, val_b, test_b are within
the inclusive range [0, n_buckets-1] (reject negatives and >= n_buckets),
keeping the existing overlap check; update error messages to reference n_buckets
and offending values so users can correct configs.
| logger.info( | ||
| f"MGF Loading Summary: Processed {spectra_processed} spectra - " | ||
| f"Passed validation: {spectra_yielded + spectra_failed_parsing + spectra_failed_nan_check}, " | ||
| f"Failed validation: {spectra_failed_validation}, " | ||
| f"Failed parsing: {spectra_failed_parsing}, " | ||
| f"Failed NaN check: {spectra_failed_nan_check}, " | ||
| f"Final yielded: {spectra_yielded}" | ||
| ) | ||
|
|
||
| # Aggregate condition breakdown from all processed spectra | ||
| if total_results and len(total_results) > 0: | ||
| condition_fails = { | ||
| "library_quality": 0, | ||
| "charge": 0, | ||
| "pepmass": 0, | ||
| "mz_array": 0, | ||
| "ion_mode": 0, | ||
| "name_format": 0, | ||
| "centroid": 0, | ||
| "inchi_smiles": 0, | ||
| } | ||
|
|
||
| for res in total_results: | ||
| if not res.get("cond_library", True): | ||
| condition_fails["library_quality"] += 1 | ||
| if not res.get("cond_charge", True): | ||
| condition_fails["charge"] += 1 | ||
| if not res.get("cond_pepmass", True): | ||
| condition_fails["pepmass"] += 1 | ||
| if not res.get("cond_mz_array", True): | ||
| condition_fails["mz_array"] += 1 | ||
| if not res.get("cond_ion_mode", True): | ||
| condition_fails["ion_mode"] += 1 | ||
| if not res.get("cond_name", True): | ||
| condition_fails["name_format"] += 1 | ||
| if not res.get("cond_centroid", True): | ||
| condition_fails["centroid"] += 1 | ||
| if not res.get("cond_inchi_smiles", True): | ||
| condition_fails["inchi_smiles"] += 1 | ||
|
|
||
| # Log conditions causing most failures | ||
| logger.info("Validation condition failures (spectra rejected for each condition):") | ||
| for condition, count in sorted(condition_fails.items(), key=lambda x: x[1], reverse=True): | ||
| if count > 0: | ||
| pct = 100.0 * count / spectra_processed | ||
| logger.info(f" {condition}: {count} spectra ({pct:.1f}%)") |
There was a problem hiding this comment.
Summary log is skipped when caller stops iterating early.
get_spectra is a generator, and the MGF Loading Summary / condition-breakdown logs sit after the for mgf_index, spectrum in spectrum_it(): loop. They only execute if the generator is exhausted. In get_all_spectra_mgf (Line 571–578) the caller iterates up to num_samples and then simply drops the generator reference; if the cap is reached before EOF, the underlying generator never resumes and the summary is never emitted — which defeats the point of the added observability for any capped run.
Options:
- Move the accounting into
get_all_spectra_mgf(pass counters by reference, or collect them after atry/finallythat runs onGeneratorExit). - Use a
try: ... finally:insideget_spectraso the summary is logged on both exhaustion and early close.
🛡️ Sketch using try/finally
- for mgf_index, spectrum in spectrum_it():
- spectra_processed += 1
- ...
-
- logger.info(
- f"MGF Loading Summary: ..."
- )
+ try:
+ for mgf_index, spectrum in spectrum_it():
+ spectra_processed += 1
+ ...
+ finally:
+ logger.info(
+ f"MGF Loading Summary: ..."
+ )
+ # condition breakdown block ...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@simba/core/data/loaders/loaders.py` around lines 118 - 163, The
summary/logging block after the for-loop in get_spectra is skipped when the
caller (e.g., get_all_spectra_mgf) stops iterating early; wrap the generator
body in a try: ... finally: and move the "MGF Loading Summary" and
condition_fails aggregation into the finally so it always runs on normal
exhaustion or early close (GeneratorExit). Ensure the finally reads the same
counters/total_results variables used inside get_spectra and preserves their
values, and do not rely on code after the loop to emit the logs. This guarantees
the summary is emitted whether the generator is fully consumed or
garbage-collected by get_all_spectra_mgf.
| total = len(spectra) | ||
| logger.info( | ||
| f"Split sizes – Train: {len(spectrums_train)} ({100 * len(spectrums_train) / total:.1f}%), " | ||
| f"Val: {len(spectrums_val)} ({100 * len(spectrums_val) / total:.1f}%), " | ||
| f"Test: {len(spectrums_test)} ({100 * len(spectrums_test) / total:.1f}%) " | ||
| f"[train buckets: {sorted(train_buckets)}, val: {sorted(val_buckets)}, test: {sorted(test_buckets)}]" | ||
| ) |
There was a problem hiding this comment.
ZeroDivisionError when spectra is empty.
If spectra is an empty list, total = 0 and the percentage computations in the log line (100 * len(...) / total) will raise ZeroDivisionError. Guard the logging when total == 0.
🛡️ Suggested guard
- total = len(spectra)
- logger.info(
- f"Split sizes – Train: {len(spectrums_train)} ({100 * len(spectrums_train) / total:.1f}%), "
- f"Val: {len(spectrums_val)} ({100 * len(spectrums_val) / total:.1f}%), "
- f"Test: {len(spectrums_test)} ({100 * len(spectrums_test) / total:.1f}%) "
- f"[train buckets: {sorted(train_buckets)}, val: {sorted(val_buckets)}, test: {sorted(test_buckets)}]"
- )
+ total = len(spectra)
+ if total:
+ logger.info(
+ f"Split sizes – Train: {len(spectrums_train)} ({100 * len(spectrums_train) / total:.1f}%), "
+ f"Val: {len(spectrums_val)} ({100 * len(spectrums_val) / total:.1f}%), "
+ f"Test: {len(spectrums_test)} ({100 * len(spectrums_test) / total:.1f}%) "
+ f"[train buckets: {sorted(train_buckets)}, val: {sorted(val_buckets)}, test: {sorted(test_buckets)}]"
+ )
+ else:
+ logger.info("Split called on empty spectra list")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@simba/core/training/train_utils.py` around lines 113 - 119, The logging line
in train_utils.py can raise ZeroDivisionError when spectra is empty because
total = len(spectra) may be 0; update the block around logger.info so it checks
if total == 0 and uses safe values (e.g., 0 or "N/A") for the percentage
calculations instead of performing 100 * len(...) / total; adjust the message
construction for logger.info (referencing total, spectrums_train, spectrums_val,
spectrums_test, train_buckets, val_buckets, test_buckets) to use the guarded
percentages so no division occurs when total is zero.
| logger.info( | ||
| f"Found {len(unique_smiles)} unique SMILES from {len(all_spectra)} spectra (compression: {len(all_spectra) / len(unique_smiles):.2f}x)" | ||
| ) |
There was a problem hiding this comment.
ZeroDivisionError on empty input.
len(all_spectra) / len(unique_smiles) will divide by zero when all_spectra is empty (so unique_smiles is also empty). Worth short-circuiting the log when there are no spectra.
🛡️ Suggested guard
- logger.info(
- f"Found {len(unique_smiles)} unique SMILES from {len(all_spectra)} spectra (compression: {len(all_spectra) / len(unique_smiles):.2f}x)"
- )
+ if len(unique_smiles):
+ logger.info(
+ f"Found {len(unique_smiles)} unique SMILES from {len(all_spectra)} spectra "
+ f"(compression: {len(all_spectra) / len(unique_smiles):.2f}x)"
+ )
+ else:
+ logger.info(f"Found 0 unique SMILES from {len(all_spectra)} spectra")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| logger.info( | |
| f"Found {len(unique_smiles)} unique SMILES from {len(all_spectra)} spectra (compression: {len(all_spectra) / len(unique_smiles):.2f}x)" | |
| ) | |
| if len(unique_smiles): | |
| logger.info( | |
| f"Found {len(unique_smiles)} unique SMILES from {len(all_spectra)} spectra " | |
| f"(compression: {len(all_spectra) / len(unique_smiles):.2f}x)" | |
| ) | |
| else: | |
| logger.info(f"Found 0 unique SMILES from {len(all_spectra)} spectra") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@simba/core/training/train_utils.py` around lines 212 - 214, The logger.info
call that computes compression using len(all_spectra) / len(unique_smiles) can
raise ZeroDivisionError when all_spectra (and thus unique_smiles) is empty;
update the code around the logger.info (the block using variables all_spectra
and unique_smiles) to short-circuit when not all_spectra (e.g., if not
all_spectra: logger.info("No spectra found.") or log counts with compression
"N/A") so you never perform the division, otherwise keep the existing formatted
message when there is at least one spectrum.
| unique_molecules_total = len({s.params.get("smiles", "N/A") for s in all_spectra}) | ||
| logger.info( | ||
| f"Loaded {len(all_spectra)} spectra with {unique_molecules_total} unique molecules" | ||
| ) |
There was a problem hiding this comment.
Unique-molecule count may be inflated by missing SMILES.
Using "N/A" as the default for spectra lacking a smiles param means every such spectrum collapses into a single synthetic "unique molecule", silently inflating unique_molecules_total by up to 1 and masking data-quality issues. Consider filtering out falsy values instead and/or logging the count of spectra without SMILES. The same pattern is repeated at Lines 176-178 for the per-split counts.
🔧 Proposed fix
- unique_molecules_total = len({s.params.get("smiles", "N/A") for s in all_spectra})
+ smiles_set = {s.params.get("smiles") for s in all_spectra}
+ n_missing = sum(1 for v in smiles_set if not v)
+ smiles_set.discard(None)
+ smiles_set.discard("")
+ unique_molecules_total = len(smiles_set)
+ if n_missing:
+ logger.warning(f"{n_missing} spectra missing SMILES were excluded from unique count")
logger.info(
f"Loaded {len(all_spectra)} spectra with {unique_molecules_total} unique molecules"
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@simba/workflows/preprocessing.py` around lines 148 - 151, The current
unique_molecules_total uses a sentinel "N/A" which collapses all missing SMILES
into one fake molecule; change the set comprehension to only include truthy
SMILES (e.g., set(s.params.get("smiles") for s in all_spectra if
s.params.get("smiles"))) and compute a missing_smiles count (e.g., sum(1 for s
in all_spectra if not s.params.get("smiles"))) then update the logger.info call
to report both Loaded {len(all_spectra)} spectra, {unique_molecules_total}
unique molecules, and {missing_smiles} spectra with missing SMILES; apply the
same filtering and missing count logic to the per-split unique-molecule
calculations that mirror this pattern.
| from rdkit import Chem as _Chem | ||
|
|
||
| from simba.core.chemistry.edit_distance.edit_distance import ( | ||
| filter_cache_by_smiles, | ||
| ) | ||
|
|
||
| canon_smiles = set() | ||
| for s in all_smiles: | ||
| canon = _Chem.CanonSmiles(s) | ||
| if canon: | ||
| canon_smiles.add(canon) | ||
| else: | ||
| canon_smiles.add(s) | ||
|
|
||
| precomputed_cache = filter_cache_by_smiles( | ||
| precomputed_cache, list(all_smiles) | ||
| precomputed_cache, list(canon_smiles) | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Confirm how filter_cache_by_smiles interprets its keys and how keys were produced.
fd -t f edit_distance.py
rg -nP -C4 '\bdef\s+filter_cache_by_smiles\b'
rg -nP -C3 '\bload_precomputed_distances_cache\b'
rg -nP -C3 'CanonSmiles|canon_smiles' --type=pyRepository: bittremieuxlab/simba
Length of output: 31215
🏁 Script executed:
cd simba && head -60 core/chemistry/edit_distance/edit_distance.pyRepository: bittremieuxlab/simba
Length of output: 1748
🏁 Script executed:
rg -nP -A15 '\bdef\s+filter_cache_by_smiles\b' simba/core/chemistry/edit_distance/edit_distance.pyRepository: bittremieuxlab/simba
Length of output: 609
🏁 Script executed:
rg -nP -A30 '\bdef\s+load_precomputed_distances_cache\b' simba/core/chemistry/edit_distance/edit_distance.pyRepository: bittremieuxlab/simba
Length of output: 1156
🏁 Script executed:
rg -nP 'cache\[|cache\.get|cache\.' simba/core/chemistry/edit_distance/edit_distance.py | head -20Repository: bittremieuxlab/simba
Length of output: 303
🏁 Script executed:
rg -nP -A100 '\bdef\s+load_precomputed_distances_cache\b' simba/core/chemistry/edit_distance/edit_distance.py | head -120Repository: bittremieuxlab/simba
Length of output: 3980
🏁 Script executed:
rg -nP -A150 '\bdef\s+load_precomputed_distances_cache\b' simba/core/chemistry/edit_distance/edit_distance.py | tail -80Repository: bittremieuxlab/simba
Length of output: 3367
🏁 Script executed:
rg -nP -A50 '\bdef\s+_load_distances_from_files\b' simba/core/chemistry/edit_distance/edit_distance.pyRepository: bittremieuxlab/simba
Length of output: 1732
🌐 Web query:
RDKit Chem.CanonSmiles invalid SMILES behavior raises exception
💡 Result:
Yes, RDKit's Chem.CanonSmiles raises an exception when given an invalid SMILES string. It internally calls Chem.MolFromSmiles(smi), which returns None for invalid SMILES. Then, Chem.MolToSmiles(None) triggers an ArgumentError because no C++ Mol object is provided. This is shown in the source code and Python wrapper: CanonSmiles(smi) does m = MolFromSmiles(smi); return MolToSmiles(m, ...). If parsing fails, m is None, causing the exception on MolToSmiles. Examples from GitHub issues confirm this: invalid inputs lead to ArgumentError (e.g., issues #8670, #5443). MolFromSmiles itself returns None silently for parse failures (issue #22), but CanonSmiles propagates the error via exception. To handle safely, check if mol = Chem.MolFromSmiles(smi) is not None before calling Chem.MolToSmiles(mol). Chem.CanonSmiles does not return a string or empty for invalid input; it raises.
Citations:
- 1: https://www.rdkit.org/docs/source/rdkit.Chem.html
- 2: https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/SmilesParse/SmilesWrite.cpp
- 3: Canonical smiles can't be kekulized by MolFromSmiles rdkit/rdkit#8670
- 4: Canonicalize valid, canonicalized SMILES raises ArgumenError exception rdkit/rdkit#5443
- 5: Can't parse SMILES strings rdkit/rdkit#3722
- 6: Smiles Parse Error after RDKit Canonicalization rdkit/rdkit#7696
- 7: cannot read SMILES which has been canonized rdkit/rdkit#2967
Unguarded CanonSmiles call will crash preprocessing on invalid SMILES.
Chem.CanonSmiles(s) raises an ArgumentError (it internally calls MolFromSmiles followed by MolToSmiles on the result) on invalid SMILES rather than returning an empty or falsy value. The current code has no exception handling, so a single invalid SMILES in the dataset will abort the entire preprocessing run. Additionally, the else: canon_smiles.add(s) branch at lines 284–285 is unreachable because the exception is raised before evaluation can continue.
Wrap the call in a try-except block to fall back to the original SMILES on canonicalization failure and log a warning. Also verify that filter_cache_by_smiles will match: the cache keys are built from whichever SMILES column exists in the preprocessing pickle (preferring canon_smiles if available, otherwise smiles). If the cache was created with raw SMILES keys but filtering now passes canonicalized SMILES, all entries will silently fail to match, causing zero cache hits without warning.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@simba/workflows/preprocessing.py` around lines 273 - 289, The call to
_Chem.CanonSmiles(s) in the preprocessing loop is unguarded and will raise an
ArgumentError on invalid SMILES (making the else branch unreachable); wrap the
call in a try/except that catches the RDKit error, falls back to adding the
original s to canon_smiles, and emits a warning log (use the module logger or
processLogger). After building canon_smiles, ensure filter_cache_by_smiles will
match precomputed_cache keys by either passing both canonical and raw SMILES
(e.g., combine canon_smiles and original all_smiles) or detect the cache key
type and normalize the list accordingly before calling filter_cache_by_smiles so
cache hits are preserved.
| def load_pickle(preprocessing_dir: str) -> dict: | ||
| p = Path(preprocessing_dir) | ||
| candidates = list(p.glob("mapping_unique_smiles.pkl")) | ||
| if not candidates: | ||
| candidates = list(p.glob("*.pkl")) | ||
| if not candidates: | ||
| raise FileNotFoundError(f"No pickle found in {preprocessing_dir}") | ||
| path = candidates[0] | ||
| print(f" Loading {path}") | ||
| with open(path, "rb") as f: | ||
| return dill.load(f) |
There was a problem hiding this comment.
Non-deterministic fallback pickle selection.
When mapping_unique_smiles.pkl is absent, the fallback p.glob("*.pkl") returns results in arbitrary filesystem order and candidates[0] is used directly. On runs that produce multiple pickles (e.g., both "lightweight" and "full" format artifacts), the two directories being compared could silently load different representative files, producing misleading comparisons. Consider sorted(...) for determinism and/or failing loudly when multiple candidates exist.
♻️ Suggested change
- candidates = list(p.glob("mapping_unique_smiles.pkl"))
- if not candidates:
- candidates = list(p.glob("*.pkl"))
+ candidates = sorted(p.glob("mapping_unique_smiles.pkl"))
+ if not candidates:
+ candidates = sorted(p.glob("*.pkl"))
if not candidates:
raise FileNotFoundError(f"No pickle found in {preprocessing_dir}")
+ if len(candidates) > 1:
+ print(f" Warning: multiple pickles found, using {candidates[0].name}")
path = candidates[0]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tools/compare_preprocessing_runs.py` around lines 31 - 41, The load_pickle
function currently picks the first match from p.glob("*.pkl") which is
non-deterministic; change it to use a deterministic selection and fail loudly
when ambiguous: after collecting candidates from Path.glob, sort the candidate
paths (e.g., sorted(candidates)) and if more than one .pkl candidate exists
raise a clear FileExistsError (or similar) asking the caller to disambiguate
rather than silently using candidates[0]; keep the existing behavior of
preferring "mapping_unique_smiles.pkl" when present and only apply deterministic
sorting / ambiguity check to the fallback branch.
| pairs = {} | ||
| for row in all_rows: | ||
| i, j = int(row[0]), int(row[1]) | ||
| s0 = idx_to_smiles.get(i) | ||
| s1 = idx_to_smiles.get(j) | ||
| if s0 is None or s1 is None: | ||
| continue | ||
| pairs[frozenset([s0, s1])] = (float(row[2]), float(row[3])) | ||
| return pairs |
There was a problem hiding this comment.
frozenset pair key silently collapses self-pairs and loses multiplicity.
Using frozenset([s0, s1]) as the key has two footguns worth flagging:
- A self-pair
(i, i)(same SMILES on both sides) becomes a size-1 frozenset, which will still compare equal to any other self-pair of the same molecule but cannot be distinguished from an ordered pair involving only that one molecule. - If the same
{s0, s1}pair appears in multiple.npychunks (or with both orderings), later rows silently overwrite earlier ones inpairs— so duplicate-detection and chunk-overlap bugs in the upstream pipeline won't surface here.
At minimum, consider detecting duplicate keys with conflicting (ed, mces) values during the build and reporting them, since that is exactly the kind of inconsistency this tool is meant to catch.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tools/compare_preprocessing_runs.py` around lines 93 - 101, The current use
of frozenset([s0, s1]) as the dict key collapses self-pairs and hides
duplicate/conflicting entries; change the key to a stable ordered identifier
(e.g., (s0, s1) or include indices like (i, j, s0, s1)) in the loop that builds
pairs so self-pairs stay distinct and order is preserved, and add a
duplicate-detection check: when the key already exists in pairs compare the
existing (ed, mces) tuple to the new (float(row[2]), float(row[3])) and if they
differ, log or raise an error/report the conflict (use the variables pairs, row,
i, j, s0, s1, and the stored value) so upstream chunk-overlap or inconsistent
data is surfaced rather than silently overwritten.
| # Assertion: a pair of shared molecules must be in both or neither | ||
| if only_in_r1 or only_in_r2: | ||
| any_assertion_failed = True | ||
| print("\n ❌ ASSERTION FAILED: inconsistent pairs for shared molecules!") | ||
| if only_in_r1: | ||
| example = next(iter(only_in_r1)) | ||
| ed, mc = pairs1[example] | ||
| print( | ||
| f" Example only in run1: {sorted(example)} ED={ed:.1f} MCES={mc:.1f}" | ||
| ) | ||
| if only_in_r2: | ||
| example = next(iter(only_in_r2)) | ||
| ed, mc = pairs2[example] | ||
| print( | ||
| f" Example only in run2: {sorted(example)} ED={ed:.1f} MCES={mc:.1f}" | ||
| ) | ||
| else: | ||
| print("\n ✅ Assertion passed: all shared-molecule pairs are consistent.") |
There was a problem hiding this comment.
Assertion does not verify ED/MCES equality as documented.
The module docstring (lines 15–16) states the assertion guarantees that shared-molecule pairs are present in both runs "with identical ED and MCES values". However, only_in_r1/only_in_r2 are computed purely via set membership on frozenset keys (lines 142–146), so two runs with the same pair keys but differing ED/MCES values will pass this check. The value mismatch is only surfaced visually via the ⚠️ marker in the sample loop (lines 179–181) and never flips any_assertion_failed, so the script will exit 0 despite inconsistent distances — defeating the primary purpose of the tool.
🛡️ Proposed fix: also assert value equality on common pairs
common_pairs = cross_pairs1 & cross_pairs2
only_in_r1 = cross_pairs1 - cross_pairs2
only_in_r2 = cross_pairs2 - cross_pairs1
+
+ # Value mismatches on pairs present in both runs
+ value_mismatches = [
+ (p, pairs1[p], pairs2[p])
+ for p in common_pairs
+ if abs(pairs1[p][0] - pairs2[p][0]) > 1e-6
+ or abs(pairs1[p][1] - pairs2[p][1]) > 1e-6
+ ]
@@
- if only_in_r1 or only_in_r2:
+ if only_in_r1 or only_in_r2 or value_mismatches:
any_assertion_failed = True
print("\n ❌ ASSERTION FAILED: inconsistent pairs for shared molecules!")
+ if value_mismatches:
+ p, (ed1, mc1), (ed2, mc2) = value_mismatches[0]
+ print(
+ f" Example value mismatch: {sorted(p)} "
+ f"ED={ed1:.3f}/{ed2:.3f} MCES={mc1:.3f}/{mc2:.3f} "
+ f"(total {len(value_mismatches)})"
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tools/compare_preprocessing_runs.py` around lines 153 - 170, The current
check only compares keys (only_in_r1/only_in_r2) and misses cases where a pair
exists in both runs but with different ED/MCES values; update the assertion to
also compute common_keys = set(pairs1.keys()) & set(pairs2.keys()) and iterate
those keys to compare (pairs1[key] vs pairs2[key]), and if any ED or MCES differ
set any_assertion_failed = True and print a clear example (reusing the existing
print formatting for ED/MCES); keep the existing key-presence checks
(only_in_r1/only_in_r2) but ensure mismatched values trigger the failure path so
the script exits non-zero when distances differ.
Summary by CodeRabbit
Release Notes
New Features
Documentation
Tests