Balance across ionmodes #278

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

niekdejonge merged 49 commits into main from balance_across_ionmodes

Jan 27, 2026

CHANGELOG.md

-Original file line number
+Diff line change
@@ Expand Up @@
     and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
     ## [Unreleased]
+    ### Added
+    - The training pair sampling for both ionmodes is now balanced over the different ionmode pairs.
     ### Fixed
     - Datasplit of test, train and val, is not done sepparately for ionmodes anymore.
     ### Changed
     - Settings include file name of spectra now. This makes tracking of runs more easily and more flexibility for results folder.
+    - Split the different datagenerators to different files, before they were all in data_generators.py
+    - Renamed SpectrumPairGenerator -> TrainingBatchGenerator, this better captures what the class does.
+    - Moved the data augmentation to a separate file out of the TrainingBatchGenerator.
+    - Refactored the data augmentation to make it a bit more modular and testable (also added extra tests)
+    - Moved the Spectrum picking from TraininBatchGenerator into InchikeyPairGenerator and renamed InchikeyPairGenerator to SpectrumPairGenerator.
+    - Turned the new SpectrumPairGenerator (InchikeyPairGenerator before) into a real generator, before we had a generator method returning a generator.
     ## [2.6.0] - 2025-12-05
     ### Changed
@@ Expand Down @@

README.md

-Original file line number
+Diff line change
@@ Expand Up / @@ -65,6 +65,7 @@ Alternatively there are some example scripts below. @@
     ## 1) Compute spectral similarities
     We provide a model which was trained on > 500,000 MS/MS combined spectra from [GNPS](https://gnps.ucsd.edu/), [Mona](https://mona.fiehnlab.ucdavis.edu/), MassBank and MSnLib.
     This model can be downloaded from [from zenodo here](https://zenodo.org/records/17826815). Only the ms2deepscore_model.pt is needed.
     The model works for spectra in both positive and negative ionization modes and even predictions across ionization modes can be made by this model.
@@ Expand Down @@

ms2deepscore/SettingsMS2Deepscore.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -87,6 +87,12 @@ class SettingsMS2Deepscore: @@
                 The in between layers to be used. Default = (2000, 2000, 2000)
             embedding_dim:
                 The dimension of the final embedding. Default = 400
+            ionisation_mode:
+                The ionisation mode that is used for training the model.
+            balanced_sampling_across_ionmodes:
+                If True the model will do separate pair sampling for training for each ionmode.
+                This gives better balance over the ionmodes. Initial results showed a decrease in pos-pos prediction
+                accuracy. Which you can find in the notebook model_benchmarking/Compare balanced cross ion moe sampling.ipynb
             additional_metadata:
                 Additional metadata that should be used in training the model. e.g. precursor_mz
             dropout_rate:
@@ Expand Down Expand Up / @@ -184,6 +190,7 @@ def __init__(self, validate_settings=True, **settings): @@
             self.embedding_dim = 500
             self.ionisation_mode = "positive"
             self.activation_function = "relu"
+            self.balanced_sampling_across_ionmodes = False
             # additional model structure options
             self.train_binning_layer: bool = False
@@ Expand Down Expand Up / @@ -295,6 +302,8 @@ def validate_settings(self): @@
             if self.loss_function.lower() not in LOSS_FUNCTIONS:
                 raise ValueError(f"Unknown loss function. Must be one of: {LOSS_FUNCTIONS.keys()}")
             validate_bin_order(self.same_prob_bins)
+            if self.balanced_sampling_across_ionmodes and self.ionisation_mode != "both":
+                raise ValueError("Balanced sampling across ionmodes only works if you train on both ionmodes")
         def create_model_directory_name(self):
             """Creates a directory name using metadata, it will contain the metadata, the binned spectra and final model"""
@@ Expand Down @@

ms2deepscore/models/EmbeddingEvaluatorModel.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -9,8 +9,7 @@ @@
     from ms2deepscore.models.helper_functions import initialize_device
     from ms2deepscore.models.io_utils import _settings_to_json
     from ms2deepscore.SettingsMS2Deepscore import SettingsEmbeddingEvaluator
-    from ms2deepscore.train_new_model.data_generators import \
-        DataGeneratorEmbeddingEvaluation
+    from ms2deepscore.train_new_model.DataGeneratorEmbeddingEvaluation import DataGeneratorEmbeddingEvaluation
     from ms2deepscore.models.__model_format__ import __model_format__
@@ Expand Down @@

ms2deepscore/train_new_model/DataGeneratorEmbeddingEvaluation.py

-Original file line number
+Diff line change
@@ -0,0 +1,127 @@
+    from typing import List
+    import numpy as np
+    import pandas as pd
+    import torch
+    from matchms import Spectrum
+    from matchms.similarity.vector_similarity_functions import jaccard_similarity_matrix
+    from ms2deepscore.SettingsMS2Deepscore import SettingsEmbeddingEvaluator
+    from ms2deepscore.tensorize_spectra import tensorize_spectra
+    from ms2deepscore.train_new_model.inchikey_pair_selection import compute_fingerprints_for_training
+    from ms2deepscore.vector_operations import cosine_similarity_matrix
+    class DataGeneratorEmbeddingEvaluation:
+        """Generates data for training an embedding evaluation model.
+        This class provides a data for the training of an embedding evaluation model.
+        It follows a simple strategy: iterate through all spectra and randomly pick another
+        spectrum for comparison. This will not compensate the usually drastic biases
+        in Tanimoto similarity and is hence not meant for training the prediction of those
+        scores.
+        The purpose is rather to show a high number of spectra to a model to learn
+        embedding evaluations.
+        Spectra are sampled in groups of size batch_size. Before every epoch the indexes are
+        shuffled at random. For selected spectra the tanimoto scores, ms2deepscore scores and
+        embeddings are returned.
+        """
+        def __init__(self, spectrums: List[Spectrum],
+                     ms2ds_model,
+                     settings: SettingsEmbeddingEvaluator,
+                     device="cpu",
+                     ):
+            """
+            Parameters
+            ----------
+            spectrums
+                List of matchms Spectrum objects.
+            settings
+                The available settings can be found in SettignsMS2Deepscore
+            """
+            self.current_index = 0
+            self.settings = settings
+            self.spectrums = spectrums
+            self.inchikey14s = [s.get("inchikey")[:14] for s in spectrums]
+            self.ms2ds_model = ms2ds_model
+            self.device = device
+            self.ms2ds_model.to(self.device)
+            self.indexes = np.arange(len(self.spectrums))
+            self.batch_size = self.settings.evaluator_distribution_size
+            self.fingerprint_df = self.compute_fingerprint_dataframe(
+                self.spectrums,
+                fingerprint_type=self.ms2ds_model.model_settings.fingerprint_type,
+                fingerprint_nbits=self.ms2ds_model.model_settings.fingerprint_nbits
+                )
+            # Initialize random number generator
+            self.rng = np.random.default_rng(self.settings.random_seed)
+            self.on_epoch_end()
+        def __len__(self):
+            return int(np.floor(len(self.spectrums) / self.batch_size))
+        def __iter__(self):
+            return self
+        def __next__(self):
+            if self.current_index < self.__len__():
+                batch = self.__getitem__(self.current_index)
+                self.current_index += 1
+                return batch
+            self.current_index = 0  # make generator executable again
+            self.on_epoch_end()
+            raise StopIteration
+        def _compute_embeddings_and_scores(self, batch_index: int):
+            batch_size = self.batch_size
+            indexes = self.indexes[batch_index * batch_size:((batch_index + 1) * batch_size)]
+            spec_tensors, meta_tensors = tensorize_spectra([self.spectrums[i] for i in indexes],
+                                                           self.ms2ds_model.model_settings)
+            embeddings = self.ms2ds_model.encoder(spec_tensors.to(self.device), meta_tensors.to(self.device))
+            ms2ds_scores = cosine_similarity_matrix(embeddings.cpu().detach().numpy(), embeddings.cpu().detach().numpy())
+            # Compute true scores
+            inchikeys = [self.inchikey14s[i] for i in indexes]
+            fingerprints = self.fingerprint_df.loc[inchikeys].to_numpy()
+            tanimoto_scores = jaccard_similarity_matrix(fingerprints, fingerprints)
+            return torch.tensor(tanimoto_scores), torch.tensor(ms2ds_scores), embeddings.cpu().detach()
+        def on_epoch_end(self):
+            """Updates indexes after each epoch."""
+            self.rng.shuffle(self.indexes)
+        def __getitem__(self, batch_index: int):
+            """Generate one batch of data.
+            """
+            return self._compute_embeddings_and_scores(batch_index)
+        def compute_fingerprint_dataframe(self,
+                                          spectrums: List[Spectrum],
+                                          fingerprint_type,
+                                          fingerprint_nbits,
+                                          ) -> pd.DataFrame:
+            """Returns a dataframe with a fingerprints dataframe
+            spectrums:
+                A list of spectra
+            settings:
+                The settings that should be used for selecting the compound pairs wrapper. The settings should be specified as a
+                SettingsMS2Deepscore object.
+            """
+            fingerprints, inchikeys14_unique = compute_fingerprints_for_training(
+                spectrums,
+                fingerprint_type,
+                fingerprint_nbits
+                )
+            fingerprints_df = pd.DataFrame(fingerprints, index=inchikeys14_unique)
+            return fingerprints_df

ms2deepscore/train_new_model/SpectrumPairGenerator.py

-Original file line number
+Diff line change
@@ -0,0 +1,89 @@
+    import json
+    from collections import Counter
+    from typing import List, Tuple
+    import numpy as np
+    from matchms import Spectrum
+    class SpectrumPairGenerator:
+        def __init__(self, selected_inchikey_pairs: List[Tuple[str, str, float]], spectra,
+                     shuffle: bool = True, random_seed: int = 0):
+            """
+            Parameters
+            ----------
+            selected_inchikey_pairs:
+                A list with tuples encoding inchikey pairs like: (inchikey1, inchikey2, tanimoto_score)
+            """
+            self.selected_inchikey_pairs = selected_inchikey_pairs
+            self.spectra = spectra
+            self.spectrum_inchikeys = np.array([s.get("inchikey")[:14] for s in self.spectra])
+            self.shuffle = shuffle
+            self.random_nr_generator = np.random.default_rng(random_seed)
+            self._idx = 0
+            if self.shuffle:
+                self.random_nr_generator.shuffle(self.selected_inchikey_pairs)
+        def __iter__(self):
+            return self
+        def __next__(self):
+            # reshuffle when we've gone through everything
+            if self._idx >= len(self.selected_inchikey_pairs):
+                self._idx = 0
+                if self.shuffle:
+                    self.random_nr_generator.shuffle(self.selected_inchikey_pairs)
+            inchikey1, inchikey2, tanimoto_score = self.selected_inchikey_pairs[self._idx]
+            spectrum1 = self._get_spectrum_with_inchikey(inchikey1, self.random_nr_generator)
+            spectrum2 = self._get_spectrum_with_inchikey(inchikey2, self.random_nr_generator)
+            self._idx += 1
+            return spectrum1, spectrum2, tanimoto_score
+        def __len__(self):
+            return len(self.selected_inchikey_pairs)
+        def __str__(self):
+            return f"SpectrumPairGenerator with {len(self.selected_inchikey_pairs)} pairs available"
+        def get_scores(self):
+            return [score for _, _, score in self.selected_inchikey_pairs]
+        def get_inchikey_counts(self) -> Counter:
+            """returns the frequency each inchikey occurs"""
+            inchikeys = Counter()
+            for inchikey_1, inchikey_2, _ in self.selected_inchikey_pairs:
+                inchikeys[inchikey_1] += 1
+                inchikeys[inchikey_2] += 1
+            return inchikeys
+        def get_scores_per_inchikey(self):
+            inchikey_scores = {}
+            for inchikey_1, inchikey_2, score in self.selected_inchikey_pairs:
+                if inchikey_1 in inchikey_scores:
+                    inchikey_scores[inchikey_1].append(score)
+                else:
+                    inchikey_scores[inchikey_1] = []
+                if inchikey_2 in inchikey_scores:
+                    inchikey_scores[inchikey_2].append(score)
+                else:
+                    inchikey_scores[inchikey_2] = []
+            return inchikey_scores
+        def save_as_json(self, file_name):
+            data_for_json = [(item[0], item[1], float(item[2])) for item in self.selected_inchikey_pairs]
+            with open(file_name, "w", encoding="utf-8") as f:
+                json.dump(data_for_json, f)
+        def _get_spectrum_with_inchikey(self, inchikey: str, random_number_generator) -> Spectrum:
+            """
+            Get a random spectrum matching the `inchikey` argument.
+            NB: A compound (identified by an
+            inchikey) can have multiple measured spectrums in a binned spectrum dataset.
+            """
+            matching_spectrum_id = np.where(self.spectrum_inchikeys == inchikey)[0]
+            if len(matching_spectrum_id) <= 0:
+                raise ValueError("No matching inchikey found (note: expected first 14 characters)")
+            return self.spectra[random_number_generator.choice(matching_spectrum_id)]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Balance across ionmodes #278

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Uh oh!