feature: add all recent metadata fixes by rukubrakov · Pull Request #21 · bittremieuxlab/simba

rukubrakov · 2026-03-04T13:42:58Z

This pull request introduces a major feature to speed up preprocessing by enabling the reuse of precomputed molecular distances between related datasets. It adds support for automatically discovering, loading, and matching cached distance files, and integrates this cache into the main distance computation pipeline. Additionally, the update includes improvements to data augmentation routines, minor bug fixes, and dependency updates.

Precomputed Distance Reuse and Caching:

Added support for reusing precomputed molecular distances across preprocessing runs by introducing a cache mechanism that auto-discovers distance files and SMILES mappings from previous runs, matches molecules by SMILES, and logs cache hit/miss statistics. This feature is documented in the README.md and configured via the precomputed_distances field in default.yaml. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]

Data Augmentation Improvements:

Improved the peak augmentation logic to better handle zeroing and packing of array values, introduced a static utility for this operation, and adjusted augmentation probabilities for more realistic data augmentation. [1] [2] [3] [4]

Algorithm and Dependency Updates:

Switched the default MCES solver to "HiGHS" for better performance and added highspy as a dependency. [1] [2]

Configuration and Error Handling:

Added a use_ion_mode flag to the model config for more flexible feature engineering.
Improved error handling in analog_discovery.py to display full tracebacks for easier debugging.

Other Technical Updates:

Added dill import for robust pickle loading in distance cache routines.

These changes collectively enable faster and more robust preprocessing for related datasets, improve data augmentation, and enhance overall usability and maintainability.

Summary by CodeRabbit

New Features
- Added use_ion_mode and expanded metadata support (ion mode, adduct, collision energy, ion activation, ionization method) across data, models, encoders, and inference.
- MAE (mean absolute error) now reported alongside correlation for edit-distance and similarity evaluations.
Bug Fixes / Improvements
- Better error diagnostics with full tracebacks in discovery runs.
- Data augmentation made more conservative and more robust (reworked peak masking/dropout and metadata masking).
- Training/inference handle missing spectra with warnings and remapping.
Chores
- Updated ignore patterns and test command flags.

coderabbitai · 2026-03-04T13:43:22Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds ion-mode/adduct/CE/activation/method metadata plumbing across encoding, datasets, encoders, models, builders, and workflows; refactors peak augmentation to use a new zero-and-pack helper and lowers augmentation probabilities; renames adduct encoding to return mass lists; propagates use_ion_mode through training/inference/CLI and prints full tracebacks on analog-discovery errors.

Changes

Cohort / File(s)	Summary
Config & Ignore `./.gitignore`, `simba/configs/model/simba_default.yaml`	Add ignore patterns (`test_full_workflow/`, `simba/configs//tfs.yaml`, `run_scripts_tfs*`) and new config flag `features.use_ion_mode: false`.
Augmentation `simba/core/data/augmentation.py`	Add `zero_and_pack()` helper; replace direct zeroing with packing across multiple augmentations; lower several augmentation probabilities (defaults → 0.5); remove inversion augmentation; convert `peak_augmentation_max_peaks` from staticmethod to instance method.
Encoding `simba/core/data/encoding.py`	Rename `encode_adduct` → `encode_adduct_mass`; return a list of masses (one element per known adduct index) instead of a single float.
Datasets & Builders `simba/core/data/datasets/encoder_dataset_builder.py`, `simba/core/data/datasets/multitask_dataset.py`, `simba/core/data/datasets/multitask_dataset_builder.py`	Always allocate metadata arrays (ionmode, adduct, ce, ion_activation, ion_method); add and propagate `use_ion_mode` flag; change adduct handling to use `encode_adduct_mass`; populate or zero-fill fields based on corresponding `use_*` flags.
Models & Encoders `simba/core/models/spectrum_encoder.py`, `simba/core/models/similarity_models.py`	Add `use_ion_mode` constructor parameter; add optional per-feature encoders path; standardize nan->num normalization and propagate ionmode/adduct/ce/ion_activation/ion_method into model kwargs; adjust multitask noise sigma initialization to learnable nn.Parameters.
Workflows & CLI `simba/workflows/training.py`, `simba/workflows/inference.py`, `simba/commands/analog_discovery.py`, `simba/workflows/preprocessing.py`, `test_all_commands.sh`	Propagate `use_ion_mode` through data prep and model load; handle missing spectra remapping in lightweight loaders; add MAE evaluation for ED/MCES and logging; print full tracebacks on analog-discovery exceptions; add test flags enabling `use_adduct`.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI (train / infer / analog-discovery)
    participant Builder as MultitaskDataBuilder
    participant Dataset as CustomDatasetMultitasking
    participant Dataloader as DataLoader
    participant Model as SimilarityModelMultitask
    participant Encoder as SpectrumTransformerEncoderCustom
    participant Eval as Evaluator

    CLI->>Builder: from_molecule_pairs_to_dataset(..., use_ion_mode)
    Builder->>Dataset: allocate arrays (mz,intensity,ionmode,adduct,ce,ion_activation,ion_method)
    Dataset->>Dataloader: create batches
    CLI->>Model: load_from_checkpoint(..., use_ion_mode)
    Model->>Encoder: init(use_ion_mode, use_encoders?)
    Dataloader->>Model: batch (mz,intensity,ionmode,adduct,ce,ion_activation,ion_method)
    Model->>Model: nan_to_num / normalize metadata
    Model->>Encoder: forward(mz,intensity, kwargs: metadata)
    Encoder->>Model: embeddings
    Model->>Eval: predict (ED / MCES)
    Eval->>CLI: metrics (correlation, MAE)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Setting test infra, adding changes from feature/metadata #4 — Overlapping metadata/adduct/ion-mode plumbing across encoding, datasets, and models.
Feature/metadata upd #18 — Related edits on adduct encoding and metadata propagation (encode_adduct_mass, dataset/encoder changes).
Hydra setup. Separation of commands and workflow logic #15 — Touches simba/commands/analog_discovery.py and relates to exception/CLI workflow changes.

Suggested reviewers

chevi1989
Janne98
bittremieux

Poem

🐰 I hopped through arrays, packed peaks nice and neat,

Adduct masses lined up, ion modes tap their feet.
I nudged metadata into tidy rows,
The pipeline hummed as the rabbit goes.
Hooray for packed peaks and tidy model flows!

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 69.70% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'feature: add all recent metadata fixes' is vague and generic, using non-descriptive terms like 'all recent' and 'fixes' that don't convey meaningful information about the changeset.	Provide a more specific title that describes the main feature or fix, such as 'feature: add ion mode support and metadata field encoding' or 'feature: enhance metadata handling with ion mode flag and adduct encoding'.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/metadata-upd-clean

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

simba/core/data/augmentation.py (1)

33-35: ⚠️ Potential issue | 🔴 Critical

Guard max_peaks before arithmetic to prevent runtime TypeError.

Line 34 can pass None into peak_augmentation_max_peaks, and Line 77 multiplies by max_peaks. If that branch is hit, augmentation crashes.

🐛 Proposed fix

 `@staticmethod`
 def peak_augmentation_max_peaks(data_sample, p_augmentation=0.50, max_peaks=100):
+    if max_peaks is None:
+        max_peaks = 100
     # first normalize to maximum
 
     ## half of the time select maximum 20, the other half something between 5 and the maximum number of peaks

Also applies to: 68-78

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@simba/core/data/augmentation.py` around lines 33 - 35, The call to
Augmentation.peak_augmentation_max_peaks can receive max_peaks=None which later
causes a TypeError when the code multiplies by max_peaks; update the caller(s)
and/or peak_augmentation_max_peaks to guard against None by either (a)
validating max_peaks before arithmetic and returning early or using a sensible
int default (e.g., 0) when None, or (b) replacing expressions like something *
max_peaks with a safe expression that only multiplies when max_peaks is not
None; apply this fix around the call sites that supply max_peaks (the call to
Augmentation.peak_augmentation_max_peaks) and inside the
peak_augmentation_max_peaks implementation to ensure no multiplication occurs
when max_peaks is None.

simba/workflows/inference.py (1)

3-25: ⚠️ Potential issue | 🟡 Minor

Ruff warnings remain in this file.

Import block ordering and trailing whitespace are still flagged by the code-quality pipeline.

Also applies to: 345-346

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@simba/workflows/inference.py` around lines 3 - 25, The import block in
inference.py violates Ruff/PEP8 ordering and has trailing whitespace; reorder
and clean up the top-level imports (the block that includes imports like copy,
os, sys, Path, dill, lightning.pytorch as pl, numpy as np, DictConfig,
spearmanr, DataLoader, and all simba.* imports) into the canonical groups
(standard library, third-party, then local/package imports), remove any trailing
whitespace and any unused imports, and ensure imports are alphabetized within
each group to satisfy Ruff/isort expectations.

simba/core/models/spectrum_encoder.py (1)

1-6: ⚠️ Potential issue | 🟡 Minor

Ruff warnings in this hunk are still unresolved.

CI reports import ordering plus whitespace/newline violations in this file; these should be cleaned before merge.

Also applies to: 79-150

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@simba/core/models/spectrum_encoder.py` around lines 1 - 6, The imports in
spectrum_encoder.py are misordered and have whitespace/newline issues flagged by
Ruff; reorder imports into standard library, third-party, then local groups with
single blank lines between groups, put torch, depthcharge imports
(SpectrumTransformerEncoder) and depthcharge.encoders (FloatEncoder) in the
correct third-party group, remove extra trailing commas/blank lines and ensure
file ends with a single newline; apply the same ordering/whitespace fixes to the
subsequent import section around lines 79-150 and run ruff/isort to verify no
remaining violations.

simba/core/data/datasets/multitask_dataset.py (1)

109-111: ⚠️ Potential issue | 🔴 Critical

ionmode_* dictionary keys are initialized under the wrong feature flag.

This now writes dictionary["ionmode_0/1"] when self.use_ion_mode is enabled, but the keys are only created when self.use_adduct is enabled. use_ion_mode=True, use_adduct=False will fail.

Proposed fix

-        if self.use_adduct:
+        if self.use_ion_mode:
             dictionary["ionmode_0"] = np.zeros((len_data, 1), dtype=np.float32)
             dictionary["ionmode_1"] = np.zeros((len_data, 1), dtype=np.float32)
+
+        if self.use_adduct:
             dictionary["adduct_0"] = np.zeros(
                 (len_data, len(ADDUCT_TO_MASS.keys())), dtype=np.float32
             )
             dictionary["adduct_1"] = np.zeros(
                 (len_data, len(ADDUCT_TO_MASS.keys())), dtype=np.float32
             )

Also applies to: 187-193

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@simba/core/data/datasets/multitask_dataset.py` around lines 109 - 111, The
code currently creates dictionary["ionmode_0"] and dictionary["ionmode_1"] under
the wrong flag (created only when self.use_adduct is true); change the condition
so these keys are initialized whenever self.use_ion_mode is true (not
self.use_adduct), keeping the same shape and dtype (np.zeros((len_data, 1),
dtype=np.float32)); apply the same fix to the other occurrence that mirrors this
block so both places create ionmode_0/1 when self.use_ion_mode is enabled.

🧹 Nitpick comments (3)

.gitignore (1)

92-95: Remove duplicated ignore pattern for run_scripts_tfs*.

run_scripts_tfs* is already listed at Line 70, so the entry at Line 94 is redundant. Keeping only one instance will make the ignore file easier to maintain.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.gitignore around lines 92 - 95, Remove the duplicate ignore entry for the
pattern run_scripts_tfs* in the .gitignore contents: locate both occurrences of
the pattern (one around Line 70 and the other in the shown diff) and delete the
redundant one so only a single run_scripts_tfs* entry remains to keep the ignore
list tidy.

test_all_commands.sh (1)

117-159: Add one metadata workflow run with model.features.use_ion_mode=true.

The new ion-mode path isn’t exercised by this script yet. Please enable it in at least one of the metadata train/inference flows.

Suggested test coverage tweak

 uv run simba train \
     training=fast_dev \
     paths.preprocessing_dir_train=./test_full_workflow/preprocessed/ \
     paths.checkpoint_dir=./test_full_workflow/checkpoints_metadata/ \
     model.features.use_adduct=true \
+    model.features.use_ion_mode=true \
     model.features.use_ce=true \
     model.features.use_ion_activation=true \
     model.features.use_ion_method=true \

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test_all_commands.sh` around lines 117 - 159, The metadata workflows for
training/inference/analog-discovery currently set
model.features.use_adduct/ce/ion_activation/ion_method but never enable
model.features.use_ion_mode; update at least one of the metadata flows (e.g.,
the "uv run simba train" block, or the matching "uv run simba inference" block,
or the "uv run simba analog-discovery" block) to include
model.features.use_ion_mode=true so the ion-mode code path is exercised during
the test run.

simba/core/models/spectrum_encoder.py (1)

33-55: self.use_encoders is hardcoded to False, so encoder branch is dead code.

Either wire this to config/init input or remove the unused branch to reduce complexity.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@simba/core/models/spectrum_encoder.py` around lines 33 - 55, The code
hardcodes self.use_encoders = False which makes the encoder initialization block
in __init__ dead; either add a constructor parameter (e.g., use_encoders=False)
and set self.use_encoders = use_encoders so the conditional in __init__ can be
enabled, or remove the outer if self.use_encoders and always conditionally
create encoders based on the individual flags (use_adduct, use_ce,
use_ion_activation, use_ion_method); update the signature and assignments around
SpectrumEncoder.__init__ (referenced symbols: self.use_encoders, use_adduct,
use_ce, use_ion_activation, use_ion_method, adduct_encoder, ce_encoder,
ion_activation_encoder, ion_method_encoder, precursor_mz_encoder) accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@simba/core/data/augmentation.py`:
- Around line 90-92: The slice used to compute indexes_to_be_erased is
off-by-one: replace intensity_ordered_indexes[max_augmented_peaks:-1] with
intensity_ordered_indexes[max_augmented_peaks:] so all peaks after the cutoff
are removed; update the expression in the augmentation routine that assigns
indexes_to_be_erased (referencing intensity_ordered_indexes and
max_augmented_peaks) accordingly.
- Around line 227-272: The code zeroes paired metadata fields in both the
random-branch and the else branch without ensuring each key exists, which can
raise KeyError for partial samples; update the masking to be key-safe by
iterating over the explicit metadata keys (e.g.
"ionmode_0","ionmode_1","precursor_charge_0","precursor_charge_1","adduct_0","adduct_1","ce_0","ce_1","ion_activation_0","ion_activation_1")
and only assign 0 * data_sample[k] when k is present (or check membership in
keys_found before each assignment) in both the conditional blocks that
manipulate data_sample so every write is guarded against missing keys.
- Around line 208-213: Add a guard to skip dropout when n_drop == 0 and ensure
drop_indices is integer dtype before passing to Augmentation.zero_and_pack:
check n_drop (calculated from n_peaks and dropout_rate) and return or skip the
zero_and_pack call when it's zero, and when constructing drop_indices from
random.sample convert it to an integer array (e.g., dtype=np.int64) so NumPy
fancy indexing in Augmentation.zero_and_pack always receives integer indices;
refer to variables n_drop, drop_indices, intensity_array and the
Augmentation.zero_and_pack call to locate the change.

In `@simba/core/data/datasets/encoder_dataset_builder.py`:
- Around line 69-70: The current check can raise TypeError when spectrum.params
exists but is None; update the guard before testing membership so you only check
'adduct' when params is a non-None mapping. Replace the condition around the
adduct assignment (the block that references spectrum.params, adduct[i] and
encode_adduct_mass) with a safe guard such as using getattr(spectrum, 'params',
None) or explicit checks (spectrum.params is not None and
isinstance(spectrum.params, dict)) before doing "'adduct' in spectrum.params",
then call encode_adduct_mass(spectrum.params['adduct']) only when that guard
passes.
- Around line 63-65: The current ionmode assignment in
encoder_dataset_builder.py (inside the loop handling spectrum.ionmode, updating
ionmode[i]) treats any non-"positive" string as negative; change this to an
explicit mapping: if spectrum.ionmode (lowercased) == "positive" set
ionmode[i]=1.0, elif == "negative" set ionmode[i]=-1.0, else set ionmode[i]=0.0
to preserve unknown/malformed values; keep the existing None/"None" checks and
ensure you call .lower() safely only when spectrum.ionmode is a string.

In `@simba/core/data/datasets/multitask_dataset_builder.py`:
- Around line 149-150: The code naively accesses spec.params["adduct"] when
use_adduct is true, which can raise KeyError for MGF entries missing that
metadata; update the block in multitask_dataset_builder.py to first check for
the key (e.g., use spec.params.get("adduct") or "adduct" in spec.params) and
only call encode_adduct_mass when a value exists, otherwise set a sensible
default for adduct[i] (e.g., 0 or NaN) or skip encoding; ensure you reference
the existing variables/funcs adduct, use_adduct, spec.params and
encode_adduct_mass so the fix is localized.

In `@simba/core/data/encoding.py`:
- Around line 10-25: encode_adduct_mass currently returns a list but its
docstring says float and it always writes the mass to response[0], collapsing
all adducts; fix by making the adduct order explicit (convert
ADDUCT_TO_MASS.keys() to a list to preserve deterministic index order), create
response = [0.0] * len(adduct_list), find the index of the input adduct (e.g.,
via adduct_list.index(adduct) or enumerate to match) and set response[index] =
ADDUCT_TO_MASS[adduct]; update the docstring/return type to reflect that it
returns a list[float] (or change to return a single float if you prefer that
behavior) and keep returning the zero-vector when adduct is not recognized.

In `@simba/core/models/similarity_models.py`:
- Around line 431-432: Remove the trailing/blank-line whitespace from the
assignments to initial_log_sigma1 and initial_log_sigma2 in similarity_models.py
(the lines containing "initial_log_sigma1= 0.0" and "initial_log_sigma2 = -5.3")
and also remove the trailing whitespace at the occurrence around line with the
other reported issue (near the block at the second occurrence around line 792);
ensure spacing around the equals signs is normalized (e.g., single space before
and after '=') and no extra spaces at end of those lines.
- Around line 108-119: The code builds kwargs_0/kwargs_1 for the encoder but
only sets "precursor_charge" when self.use_ion_mode is true, which can omit a
required kwarg; always add "precursor_charge" to kwargs_0 and kwargs_1 using
batch["precursor_charge_0"].float() and batch["precursor_charge_1"].float(),
while keeping the "ionmode" addition conditional on self.use_ion_mode; update
the block that currently references kwargs_0/kwargs_1 and self.use_ion_mode so
precursor_charge is set unconditionally and ionmode remains guarded.
- Around line 469-513: The code assumes keys like "ionmode_0"/"1",
"adduct_0"/"1", "ce_0"/"1", "ion_activation_0"/"1", and "ion_method_0"/"1"
always exist in batch which causes KeyError; update the block in
similarity_models.py (the area that builds kwargs_0 and kwargs_1 from batch) to
first check for each key and if missing substitute a zero tensor matching the
expected shape/device/dtype (e.g., use torch.zeros or torch.zeros_like based on
another batch tensor or a batch_size reference), then apply torch.nan_to_num on
the tensor and assign float() into kwargs_0 and kwargs_1 (refer to the existing
symbols batch, torch.nan_to_num, kwargs_0, kwargs_1 and the individual keys like
"ce_0"/"ce_1") so absent metadata gracefully falls back to zeros instead of
raising KeyError.

In `@simba/core/models/spectrum_encoder.py`:
- Around line 66-111: The method that builds the metadata placeholder (inside
SpectrumEncoder / precursor_hook when not self.use_encoders) unconditionally
indexes kwargs keys like "precursor_charge", "ionmode", "adduct", "ce",
"ion_activation", and "ion_method", which can raise KeyError when feature flags
are off; change these accesses to only read the corresponding kwargs when the
matching flags (self.use_ion_mode, self.use_adduct, self.use_ce,
self.use_ion_activation, self.use_ion_method) are true (or use kwargs.get(...,
default_tensor) after validating shape) and ensure current_idx updates only when
the field was actually inserted so placeholder and current_idx remain consistent
(refer to placeholder, current_idx, and the specific kwargs keys in this block
to locate changes).

In `@simba/workflows/inference.py`:
- Around line 349-353: The MAE logging always multiplies mae_model_mces by
cfg.data.mces20_max_value which misreports results in Tanimoto mode; change the
logger.info call that reports MCES/Tanimoto mean absolute error so it only
multiplies by cfg.data.mces20_max_value when the run is in MCES20 mode (e.g.,
check a config flag indicating MCES20 vs Tanimoto) and otherwise log the raw
mae_model_mces; update the logger.info line that references mae_model_mces and
cfg.data.mces20_max_value accordingly.

---

Outside diff comments:
In `@simba/core/data/augmentation.py`:
- Around line 33-35: The call to Augmentation.peak_augmentation_max_peaks can
receive max_peaks=None which later causes a TypeError when the code multiplies
by max_peaks; update the caller(s) and/or peak_augmentation_max_peaks to guard
against None by either (a) validating max_peaks before arithmetic and returning
early or using a sensible int default (e.g., 0) when None, or (b) replacing
expressions like something * max_peaks with a safe expression that only
multiplies when max_peaks is not None; apply this fix around the call sites that
supply max_peaks (the call to Augmentation.peak_augmentation_max_peaks) and
inside the peak_augmentation_max_peaks implementation to ensure no
multiplication occurs when max_peaks is None.

In `@simba/core/data/datasets/multitask_dataset.py`:
- Around line 109-111: The code currently creates dictionary["ionmode_0"] and
dictionary["ionmode_1"] under the wrong flag (created only when self.use_adduct
is true); change the condition so these keys are initialized whenever
self.use_ion_mode is true (not self.use_adduct), keeping the same shape and
dtype (np.zeros((len_data, 1), dtype=np.float32)); apply the same fix to the
other occurrence that mirrors this block so both places create ionmode_0/1 when
self.use_ion_mode is enabled.

In `@simba/core/models/spectrum_encoder.py`:
- Around line 1-6: The imports in spectrum_encoder.py are misordered and have
whitespace/newline issues flagged by Ruff; reorder imports into standard
library, third-party, then local groups with single blank lines between groups,
put torch, depthcharge imports (SpectrumTransformerEncoder) and
depthcharge.encoders (FloatEncoder) in the correct third-party group, remove
extra trailing commas/blank lines and ensure file ends with a single newline;
apply the same ordering/whitespace fixes to the subsequent import section around
lines 79-150 and run ruff/isort to verify no remaining violations.

In `@simba/workflows/inference.py`:
- Around line 3-25: The import block in inference.py violates Ruff/PEP8 ordering
and has trailing whitespace; reorder and clean up the top-level imports (the
block that includes imports like copy, os, sys, Path, dill, lightning.pytorch as
pl, numpy as np, DictConfig, spearmanr, DataLoader, and all simba.* imports)
into the canonical groups (standard library, third-party, then local/package
imports), remove any trailing whitespace and any unused imports, and ensure
imports are alphabetized within each group to satisfy Ruff/isort expectations.

---

Nitpick comments:
In @.gitignore:
- Around line 92-95: Remove the duplicate ignore entry for the pattern
run_scripts_tfs* in the .gitignore contents: locate both occurrences of the
pattern (one around Line 70 and the other in the shown diff) and delete the
redundant one so only a single run_scripts_tfs* entry remains to keep the ignore
list tidy.

In `@simba/core/models/spectrum_encoder.py`:
- Around line 33-55: The code hardcodes self.use_encoders = False which makes
the encoder initialization block in __init__ dead; either add a constructor
parameter (e.g., use_encoders=False) and set self.use_encoders = use_encoders so
the conditional in __init__ can be enabled, or remove the outer if
self.use_encoders and always conditionally create encoders based on the
individual flags (use_adduct, use_ce, use_ion_activation, use_ion_method);
update the signature and assignments around SpectrumEncoder.__init__ (referenced
symbols: self.use_encoders, use_adduct, use_ce, use_ion_activation,
use_ion_method, adduct_encoder, ce_encoder, ion_activation_encoder,
ion_method_encoder, precursor_mz_encoder) accordingly.

In `@test_all_commands.sh`:
- Around line 117-159: The metadata workflows for
training/inference/analog-discovery currently set
model.features.use_adduct/ce/ion_activation/ion_method but never enable
model.features.use_ion_mode; update at least one of the metadata flows (e.g.,
the "uv run simba train" block, or the matching "uv run simba inference" block,
or the "uv run simba analog-discovery" block) to include
model.features.use_ion_mode=true so the ion-mode code path is exercised during
the test run.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d83799ec-f551-45d8-8853-79095f0e1177

📥 Commits

Reviewing files that changed from the base of the PR and between acbfa0a and 541f344.

📒 Files selected for processing (14)

.gitignore
simba/commands/analog_discovery.py
simba/configs/model/simba_default.yaml
simba/core/data/augmentation.py
simba/core/data/datasets/encoder_dataset_builder.py
simba/core/data/datasets/multitask_dataset.py
simba/core/data/datasets/multitask_dataset_builder.py
simba/core/data/encoding.py
simba/core/models/similarity_models.py
simba/core/models/spectrum_encoder.py
simba/workflows/inference.py
simba/workflows/preprocessing.py
simba/workflows/training.py
test_all_commands.sh

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (1)

simba/workflows/inference.py (1)

381-382: ⚠️ Potential issue | 🟡 Minor

Scale MCES MAE only for MCES20 runs.

Line 382 still multiplies by cfg.data.mces20_max_value unconditionally, so Tanimoto runs log an inflated MAE.

♻️ Minimal fix

-    logger.info(f"MCES/Tanimoto mean absolute error: {cfg.data.mces20_max_value * mae_model_mces:.4f}")
+    mae_scale = cfg.data.mces20_max_value if not cfg.data.use_tanimoto else 1.0
+    logger.info(f"MCES/Tanimoto mean absolute error: {mae_scale * mae_model_mces:.4f}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@simba/workflows/inference.py` around lines 381 - 382, The MAE logging
currently always multiplies mae_model_mces by cfg.data.mces20_max_value,
inflating values for non-MCES20 runs; update the logger.info call that uses
mae_model_mces so it only multiplies by cfg.data.mces20_max_value when the run
is an MCES20 run (e.g., check a flag like cfg.data.mces20 or cfg.data.task ==
"mces20"), otherwise log mae_model_mces unscaled; modify the line with
logger.info(f"MCES/Tanimoto mean absolute error: {cfg.data.mces20_max_value *
mae_model_mces:.4f}") to conditionally apply the scaling based on that flag.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@simba/workflows/inference.py`:
- Around line 58-61: Remove the extraneous blank lines and trailing spaces that
are causing Ruff to fail in inference.py: specifically tidy the context around
the preprocessing config usage where use_only_protonized =
getattr(cfg.preprocessing, 'use_only_protonized_adducts', True) is defined and
the other occurrences referenced (around lines with similar whitespace at the
blocks that set/consume cfg.preprocessing and use_only_protonized), ensuring
there are no whitespace-only lines or trailing spaces in those areas so the file
passes linting.
- Around line 89-99: After filtering df_smiles and remapping its "indexes" using
idx_map (the block that builds valid_rows and calls df_smiles =
df_smiles.loc[valid_rows].reset_index(drop=True)), you must also update the
pair_distances data to the new index space: locate where pair_distances (and any
related pair indices) are attached later (lines around the block that processes
pair_distances) and either remap each index in pair_distances via the same
idx_map-to-new-position mapping or drop any pair entries that reference removed
spectra; ensure the remapped indices match the post-filter df_smiles index
positions (use the same valid_rows/idx_map transformation) so no pair_distances
entries point to old/out-of-bounds indices.
- Line 24: The import from sklearn (from sklearn.metrics import
mean_absolute_error) is in the wrong block; move this import into the existing
third-party imports block in simba/workflows/inference.py so ruff sees correct
import ordering—locate the top-of-file import groups and relocate the
mean_absolute_error import into the third-party section alongside other
external-library imports.
- Around line 59-60: The code reads use_only_protonized_adducts via
getattr(cfg.preprocessing, 'use_only_protonized_adducts', True) which will raise
AttributeError if cfg.preprocessing is missing; change to safely access the
preprocessing section first (e.g., preproc = getattr(cfg, 'preprocessing', None)
or hasattr(cfg, 'preprocessing') check) and then read
use_only_protonized_adducts from preproc with a default True so the fallback
applies even when cfg.preprocessing is absent; update the code around the cfg
and use_only_protonized_adducts references to use this two-step guarded lookup.

---

Duplicate comments:
In `@simba/workflows/inference.py`:
- Around line 381-382: The MAE logging currently always multiplies
mae_model_mces by cfg.data.mces20_max_value, inflating values for non-MCES20
runs; update the logger.info call that uses mae_model_mces so it only multiplies
by cfg.data.mces20_max_value when the run is an MCES20 run (e.g., check a flag
like cfg.data.mces20 or cfg.data.task == "mces20"), otherwise log mae_model_mces
unscaled; modify the line with logger.info(f"MCES/Tanimoto mean absolute error:
{cfg.data.mces20_max_value * mae_model_mces:.4f}") to conditionally apply the
scaling based on that flag.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ae7f607d-6ebf-404d-abac-637ba7ad1967

📥 Commits

Reviewing files that changed from the base of the PR and between 4ebc042 and c2365f4.

📒 Files selected for processing (1)

simba/workflows/inference.py

…numbers

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (3)

simba/workflows/inference.py (3)

58-69: ⚠️ Potential issue | 🟠 Major

Guard the preprocessing section before reading from it.

Same issue as in training.py: getattr(cfg.preprocessing, ...) will raise AttributeError if cfg.preprocessing doesn't exist.

🛡️ Proposed defensive fix

-        use_only_protonized = getattr(
-            cfg.preprocessing, "use_only_protonated_adducts", True
-        )
+        preprocessing_cfg = getattr(cfg, "preprocessing", None)
+        use_only_protonized = (
+            getattr(preprocessing_cfg, "use_only_protonized_adducts", True)
+            if preprocessing_cfg is not None
+            else True
+        )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@simba/workflows/inference.py` around lines 58 - 69, The code reads
cfg.preprocessing without ensuring it exists, so change the logic that sets
use_only_protonized to first guard cfg.preprocessing (e.g., get the
preprocessing object via getattr(cfg, "preprocessing", None) or check
hasattr(cfg, "preprocessing") ) and then call getattr(...) on that object with
default True; update the block that computes use_only_protonized (and any
similar accesses) so load_spectra is always called with a safe boolean even when
cfg.preprocessing is absent, referencing the variables cfg, preprocessing,
use_only_protonized_adducts, use_only_protonized, and the call to
load_spectra/all_spectra.

79-103: ⚠️ Potential issue | 🟠 Major

Missing spectra handling doesn't remap indexes when no spectra are missing.

Same issue as in training.py: when missing is empty, the remapping via idx_map is skipped entirely, but indexes may still need remapping if spectra were filtered during loading.

Additionally, as noted in a past review, pair_distances (loaded later at lines 131-144) still references the pre-filter index space and is not remapped, which can cause index mismatches or out-of-bounds errors.

🐛 Proposed fix to always remap indexes

             if missing:
                 logger.warning(
                     f"[test] Missing {len(missing)} spectra (e.g., MGF index {missing[0]})"
                 )
-                # Filter df_smiles to keep only rows with valid spectra
-                valid_rows = []
-                for i in df_smiles.index:
-                    old_idxs = df_smiles.loc[i, "indexes"]
-                    if all(idx in idx_map for idx in old_idxs):
-                        # Remap to new positions
-                        df_smiles.at[i, "indexes"] = [idx_map[idx] for idx in old_idxs]
-                        valid_rows.append(i)
-                df_smiles = df_smiles.loc[valid_rows]
+
+            # Always remap indexes (needed even when missing is empty if filtering occurred)
+            valid_rows = []
+            for i in df_smiles.index:
+                old_idxs = df_smiles.loc[i, "indexes"]
+                if all(idx in idx_map for idx in old_idxs):
+                    df_smiles.at[i, "indexes"] = [idx_map[idx] for idx in old_idxs]
+                    valid_rows.append(i)
+            if missing:
+                df_smiles = df_smiles.loc[valid_rows]

Note: The pair_distances remapping issue from the past review also needs to be addressed separately.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@simba/workflows/inference.py` around lines 79 - 103, Build idx_map
unconditionally while loading spectra (map every old_idx to new sequential index
only for those mgf_idx found in spectra_by_idx), then always remap
df_smiles["indexes"] using that idx_map (filter out rows whose old indexes are
not all present in idx_map and replace their indexes with [idx_map[idx] for idx
in old_idxs]); do not skip this when missing is empty. After loading
pair_distances, also remap its index references from the original index space to
the new one using idx_map (e.g., translate any index keys or index arrays via
idx_map and drop entries that reference missing indices) so pair_distances and
df_smiles consistently reference the new original_spectra positions.

380-388: ⚠️ Potential issue | 🟡 Minor

Scale MAE conditionally for Tanimoto vs MCES20 mode.

Line 387 always multiplies MAE by cfg.data.mces20_max_value, which misreports MAE in Tanimoto mode where the scale is already 0-1.

🐛 Proposed fix

     else:
         corr_model_mces, _ = spearmanr(mces_true, pred_mces_mces_flat)
         mae_model_mces = mean_absolute_error(mces_true, pred_mces_mces_flat)
     logger.info(f"MCES/Tanimoto correlation: {corr_model_mces:.4f}")
-    logger.info(
-        f"MCES/Tanimoto mean absolute error: {cfg.data.mces20_max_value * mae_model_mces:.4f}"
-    )
+    mae_scale = cfg.data.mces20_max_value if not cfg.data.use_tanimoto else 1.0
+    logger.info(f"MCES/Tanimoto mean absolute error: {mae_scale * mae_model_mces:.4f}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@simba/workflows/inference.py` around lines 380 - 388, The MAE is always
multiplied by cfg.data.mces20_max_value even in Tanimoto mode; change the
logging to scale mae_model_mces only when MCES20 scaling is in effect (e.g.,
when cfg.data.mces20_max_value != 1.0 or when a specific MCES20 mode flag is
set). Update the lines that log MCES/Tanimoto mean absolute error to
conditionally use cfg.data.mces20_max_value * mae_model_mces when MCES20 scaling
applies, otherwise log mae_model_mces unscaled; refer to the variables
mae_model_mces and cfg.data.mces20_max_value to locate and implement this
conditional formatting.

🧹 Nitpick comments (1)

simba/workflows/training.py (1)

291-291: Minor style: remove spaces around = in keyword arguments.

PEP 8 recommends no spaces around = for keyword arguments.

✨ Proposed fix

-        use_ion_mode = cfg.model.features.use_ion_mode,
+        use_ion_mode=cfg.model.features.use_ion_mode,

Apply to both lines 291 and 301.

Also applies to: 301-301

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@simba/workflows/training.py` at line 291, Remove the extra spaces around the
equals sign in keyword arguments so they follow PEP8 (e.g., change use_ion_mode
= cfg.model.features.use_ion_mode to
use_ion_mode=cfg.model.features.use_ion_mode) in the call/site where training
config kwargs are being passed; apply the same formatting fix to the other
keyword argument in that same call (the similar kwarg on the other nearby line).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@simba/workflows/training.py`:
- Line 89: The code reads getattr(cfg.preprocessing,
'use_only_protonized_adducts', True) but will raise AttributeError if
cfg.preprocessing is missing; update the logic around use_only_protonized to
first guard the preprocessing section (e.g., check hasattr(cfg, 'preprocessing')
or use getattr(cfg, 'preprocessing', None)) and then read
use_only_protonized_adducts with a default True; modify the assignment for
use_only_protonized in simba.workflows.training (the symbol use_only_protonized)
to handle a missing cfg.preprocessing gracefully and fall back to True.
- Around line 114-121: There are whitespace-only blank lines around the loop
that iterates over spectrum_indexes (the block touching symbols
spectrum_indexes, spectra_by_idx, idx_map, original_spectra, missing) — remove
the trailing spaces or empty lines at those locations so no lines contain only
whitespace; ensure the loop block has no extra blank lines before/after it and
run the repo linter or trim trailing whitespace to prevent reintroduction.
- Around line 87-96: Remove the stray whitespace characters and trailing spaces
around the block that calls load_spectra: delete any spaces on otherwise-blank
lines (the blank lines around the assignment to use_only_protonized and the
load_spectra call) and remove the trailing whitespace after the cfg, argument
(the trailing space after "cfg,") so the lines around use_only_protonized,
mgf_path, cfg, and the load_spectra(...) invocation contain no extraneous
spaces.
- Around line 110-132: The code only applies the idx_map remapping when missing
is non-empty, which leaves df_smiles["indexes"] pointing to old positions when
load_spectra (e.g., via use_only_protonized_adducts) filtered out spectra;
always apply the remapping after building original_spectra and idx_map: after
the loop that populates original_spectra/idx_map (using spectrum_indexes and
spectra_by_idx) iterate all df_smiles rows, replace each df_smiles.at[i,
"indexes"] with [idx_map[idx] for idx in old_idxs] only keeping rows whose
old_idxs all exist in idx_map, and then filter df_smiles to those valid_rows—do
this unconditionally regardless of the value of missing so indexes are
consistent with original_spectra.

---

Duplicate comments:
In `@simba/workflows/inference.py`:
- Around line 58-69: The code reads cfg.preprocessing without ensuring it
exists, so change the logic that sets use_only_protonized to first guard
cfg.preprocessing (e.g., get the preprocessing object via getattr(cfg,
"preprocessing", None) or check hasattr(cfg, "preprocessing") ) and then call
getattr(...) on that object with default True; update the block that computes
use_only_protonized (and any similar accesses) so load_spectra is always called
with a safe boolean even when cfg.preprocessing is absent, referencing the
variables cfg, preprocessing, use_only_protonized_adducts, use_only_protonized,
and the call to load_spectra/all_spectra.
- Around line 79-103: Build idx_map unconditionally while loading spectra (map
every old_idx to new sequential index only for those mgf_idx found in
spectra_by_idx), then always remap df_smiles["indexes"] using that idx_map
(filter out rows whose old indexes are not all present in idx_map and replace
their indexes with [idx_map[idx] for idx in old_idxs]); do not skip this when
missing is empty. After loading pair_distances, also remap its index references
from the original index space to the new one using idx_map (e.g., translate any
index keys or index arrays via idx_map and drop entries that reference missing
indices) so pair_distances and df_smiles consistently reference the new
original_spectra positions.
- Around line 380-388: The MAE is always multiplied by cfg.data.mces20_max_value
even in Tanimoto mode; change the logging to scale mae_model_mces only when
MCES20 scaling is in effect (e.g., when cfg.data.mces20_max_value != 1.0 or when
a specific MCES20 mode flag is set). Update the lines that log MCES/Tanimoto
mean absolute error to conditionally use cfg.data.mces20_max_value *
mae_model_mces when MCES20 scaling applies, otherwise log mae_model_mces
unscaled; refer to the variables mae_model_mces and cfg.data.mces20_max_value to
locate and implement this conditional formatting.

---

Nitpick comments:
In `@simba/workflows/training.py`:
- Line 291: Remove the extra spaces around the equals sign in keyword arguments
so they follow PEP8 (e.g., change use_ion_mode = cfg.model.features.use_ion_mode
to use_ion_mode=cfg.model.features.use_ion_mode) in the call/site where training
config kwargs are being passed; apply the same formatting fix to the other
keyword argument in that same call (the similar kwarg on the other nearby line).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: be743799-82c4-46bf-99c8-71938406e57f

📥 Commits

Reviewing files that changed from the base of the PR and between c2365f4 and ed8b3c6.

📒 Files selected for processing (2)

simba/workflows/inference.py
simba/workflows/training.py

…tions across preprocessing runs

…switch to balanced_accuracy_score metric in inference, and configure HiGHS solver with highspy dependency.

feature: add all recent metadata fixes

541f344

rukubrakov requested review from Janne98 and chevi1989 March 4, 2026 13:48

coderabbitai bot reviewed Mar 4, 2026

View reviewed changes

rukubrakov added 3 commits March 5, 2026 12:44

fix: handle missing spectra during training

380bdb8

fix: make training respect use_only_protonized_adducts param from config

4ebc042

fix: make inference script respect preprocessing config

c2365f4

coderabbitai bot reviewed Mar 9, 2026

View reviewed changes

Comment thread simba/workflows/inference.py Outdated

Comment thread simba/workflows/inference.py Outdated

Comment thread simba/workflows/inference.py Outdated

Comment thread simba/workflows/inference.py Outdated

fix: Remove reset_index() in training/inference to keep original row …

ed8b3c6

…numbers

coderabbitai bot reviewed Mar 9, 2026

View reviewed changes

Comment thread simba/workflows/training.py Outdated

Comment thread simba/workflows/training.py Outdated

Comment thread simba/workflows/training.py

Comment thread simba/workflows/training.py Outdated

rukubrakov added 3 commits March 17, 2026 12:23

Add precomputed distances cache to reuse molecular similarity calcula…

40bb215

…tions across preprocessing runs

Apply quality fixes: fix ruff, add missing metadata fields to tests, …

a0c1787

…switch to balanced_accuracy_score metric in inference, and configure HiGHS solver with highspy dependency.

Add global cache with filtering for precomputed molecular distances

68627f4

rukubrakov merged commit b400dba into dev Mar 31, 2026
7 checks passed

rukubrakov deleted the feature/metadata-upd-clean branch March 31, 2026 14:43

Conversation

rukubrakov commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rukubrakov commented Mar 4, 2026 •

edited

Loading

coderabbitai bot commented Mar 4, 2026 •

edited

Loading