Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
83e084b
Add split by ionmode to utils.py
niekdejonge Aug 20, 2025
efd56c8
Make inchikey pair selection clearer
niekdejonge Aug 20, 2025
31c2fd3
Create inchikey_pair_selection_cross_ionmode.py
niekdejonge Aug 20, 2025
46fb9c4
Add first test for select_compound_pairs_wrapper_with_resampling_acro…
niekdejonge Aug 20, 2025
528b0e5
Move InchikeyPairGenerator to separate file
niekdejonge Aug 21, 2025
0aaa092
Move DataGeneratorEmbeddingEvaluation to separate file
niekdejonge Aug 21, 2025
f5ff5e2
Rename SpectrumPairGenerator.py
niekdejonge Aug 21, 2025
2ae0dc3
Fix InchikeyPairGenerator import
niekdejonge Aug 21, 2025
1ed2816
Factor out data augmentation from SpectrumPairGenerator
niekdejonge Aug 21, 2025
5ea6027
Fix bug in peak_removal_for_data_augmentation, picking with replacing
niekdejonge Aug 21, 2025
7a3c9b2
Add test for peak_removal_for_data_augmentation
niekdejonge Aug 21, 2025
2b61290
Add docstring to peak_addition_for_data_augmentation
niekdejonge Aug 21, 2025
187df44
Add test for peak_addition_for_data_augmentation
niekdejonge Aug 21, 2025
d71392e
Make change_peak_intensity in place
niekdejonge Aug 21, 2025
e01cd1f
Add basic check for change_peak_intensity_for_data_augmentation
niekdejonge Aug 21, 2025
2b17678
Remove unnecessary imports
niekdejonge Aug 21, 2025
4101843
Remove unnecessary imports
niekdejonge Aug 21, 2025
64cef6b
Add some typehinting
niekdejonge Aug 21, 2025
112674d
Include Spectrum selection in InchikeyPairGenerator
niekdejonge Aug 22, 2025
d73e746
Update tests to handle new InchikeyPairGenerator
niekdejonge Aug 22, 2025
b3d2a18
Rename SpectrumPairGenerator to TrainingBatchGenerator.py
niekdejonge Aug 22, 2025
9c39a44
Rename SpectrumPairGenerator to TrainingBatchGenerator.py
niekdejonge Aug 22, 2025
417e890
Make SpectrumPairGenerator a real generator
niekdejonge Aug 22, 2025
1eaae23
Rename self.spectrum_pair_generator in TrainingBatchGenerator
niekdejonge Aug 22, 2025
b514b11
Move create data generator to train_ms2deepscore.py
niekdejonge Aug 22, 2025
ad9425d
Directly return a SpectrumPairGenerator instead of list of pairs from…
niekdejonge Aug 22, 2025
b158ecf
Remove option for saving the inchikey pairs when training model
niekdejonge Aug 22, 2025
6061441
Remove create_data_generator function
niekdejonge Aug 22, 2025
7cb6d31
Remove cross ionmode function from train_ms2ds_model
niekdejonge Aug 22, 2025
9df6593
Derive nr_of_unique inchikeys from the nr of pairs
niekdejonge Aug 22, 2025
2cfbe1a
Fix test to calculate unique number of inchikeys correctly again
niekdejonge Aug 22, 2025
aeea0af
Add cross ionization mode generators
niekdejonge Aug 22, 2025
555f0d6
Fix the order of pairs in convert_to_selected_pairs_list, so pos is a…
niekdejonge Aug 22, 2025
df50155
Change test training wrapper function to both ionization modes
niekdejonge Aug 22, 2025
7fc9c24
Make train ms2deepscore handle both and single ion mode model training
niekdejonge Aug 22, 2025
0cc1cb8
Fix bug in variable naming
niekdejonge Aug 25, 2025
301bdba
added basic tests for spectrum pair generation across ionmodes.
niekdejonge Aug 27, 2025
4cd43ef
Add SpectrumPairGenerator to init
niekdejonge Aug 27, 2025
b17f82b
Remove unused import
niekdejonge Aug 27, 2025
ded5a1f
Remove duplicated test
niekdejonge Aug 27, 2025
608f292
Linting
niekdejonge Aug 27, 2025
38a85a3
Move create_data_generator_across_ionmodes to top of file
niekdejonge Aug 27, 2025
92d0470
Update CHANGELOG.md
niekdejonge Aug 27, 2025
c281a7d
Update link to zenodo for model to always point to the latest version
niekdejonge Aug 27, 2025
5efaa23
Merge branch 'main' into balance_across_ionmodes
niekdejonge Jan 26, 2026
e99a31e
Change select_compound_pairs_wrapper to create_spectrum_pair_generato…
niekdejonge Jan 27, 2026
4417ee8
Add balanced_sampling_across_ionmodes setting, to have the default us…
niekdejonge Jan 27, 2026
a77c84d
Update pair sampling tutorial to match changes made to the sampling a…
niekdejonge Jan 27, 2026
cfc1921
Add Compare balanced cross ion mode sampling.ipynb
niekdejonge Jan 27, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,20 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]
### Added
- The training pair sampling for both ionmodes is now balanced over the different ionmode pairs.

### Fixed
- Datasplit of test, train and val, is not done sepparately for ionmodes anymore.

### Changed
- Settings include file name of spectra now. This makes tracking of runs more easily and more flexibility for results folder.
- Split the different datagenerators to different files, before they were all in data_generators.py
- Renamed SpectrumPairGenerator -> TrainingBatchGenerator, this better captures what the class does.
- Moved the data augmentation to a separate file out of the TrainingBatchGenerator.
- Refactored the data augmentation to make it a bit more modular and testable (also added extra tests)
- Moved the Spectrum picking from TraininBatchGenerator into InchikeyPairGenerator and renamed InchikeyPairGenerator to SpectrumPairGenerator.
- Turned the new SpectrumPairGenerator (InchikeyPairGenerator before) into a real generator, before we had a generator method returning a generator.

## [2.6.0] - 2025-12-05
### Changed
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ Alternatively there are some example scripts below.

## 1) Compute spectral similarities
We provide a model which was trained on > 500,000 MS/MS combined spectra from [GNPS](https://gnps.ucsd.edu/), [Mona](https://mona.fiehnlab.ucdavis.edu/), MassBank and MSnLib.

This model can be downloaded from [from zenodo here](https://zenodo.org/records/17826815). Only the ms2deepscore_model.pt is needed.
The model works for spectra in both positive and negative ionization modes and even predictions across ionization modes can be made by this model.

Expand Down
9 changes: 9 additions & 0 deletions ms2deepscore/SettingsMS2Deepscore.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,12 @@ class SettingsMS2Deepscore:
The in between layers to be used. Default = (2000, 2000, 2000)
embedding_dim:
The dimension of the final embedding. Default = 400
ionisation_mode:
The ionisation mode that is used for training the model.
balanced_sampling_across_ionmodes:
If True the model will do separate pair sampling for training for each ionmode.
This gives better balance over the ionmodes. Initial results showed a decrease in pos-pos prediction
accuracy. Which you can find in the notebook model_benchmarking/Compare balanced cross ion moe sampling.ipynb
additional_metadata:
Additional metadata that should be used in training the model. e.g. precursor_mz
dropout_rate:
Expand Down Expand Up @@ -184,6 +190,7 @@ def __init__(self, validate_settings=True, **settings):
self.embedding_dim = 500
self.ionisation_mode = "positive"
self.activation_function = "relu"
self.balanced_sampling_across_ionmodes = False

# additional model structure options
self.train_binning_layer: bool = False
Expand Down Expand Up @@ -295,6 +302,8 @@ def validate_settings(self):
if self.loss_function.lower() not in LOSS_FUNCTIONS:
raise ValueError(f"Unknown loss function. Must be one of: {LOSS_FUNCTIONS.keys()}")
validate_bin_order(self.same_prob_bins)
if self.balanced_sampling_across_ionmodes and self.ionisation_mode != "both":
raise ValueError("Balanced sampling across ionmodes only works if you train on both ionmodes")

def create_model_directory_name(self):
"""Creates a directory name using metadata, it will contain the metadata, the binned spectra and final model"""
Expand Down
3 changes: 1 addition & 2 deletions ms2deepscore/models/EmbeddingEvaluatorModel.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,7 @@
from ms2deepscore.models.helper_functions import initialize_device
from ms2deepscore.models.io_utils import _settings_to_json
from ms2deepscore.SettingsMS2Deepscore import SettingsEmbeddingEvaluator
from ms2deepscore.train_new_model.data_generators import \
DataGeneratorEmbeddingEvaluation
from ms2deepscore.train_new_model.DataGeneratorEmbeddingEvaluation import DataGeneratorEmbeddingEvaluation
from ms2deepscore.models.__model_format__ import __model_format__


Expand Down
127 changes: 127 additions & 0 deletions ms2deepscore/train_new_model/DataGeneratorEmbeddingEvaluation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
from typing import List

import numpy as np
import pandas as pd
import torch
from matchms import Spectrum
from matchms.similarity.vector_similarity_functions import jaccard_similarity_matrix

from ms2deepscore.SettingsMS2Deepscore import SettingsEmbeddingEvaluator
from ms2deepscore.tensorize_spectra import tensorize_spectra
from ms2deepscore.train_new_model.inchikey_pair_selection import compute_fingerprints_for_training
from ms2deepscore.vector_operations import cosine_similarity_matrix


class DataGeneratorEmbeddingEvaluation:
"""Generates data for training an embedding evaluation model.

This class provides a data for the training of an embedding evaluation model.
It follows a simple strategy: iterate through all spectra and randomly pick another
spectrum for comparison. This will not compensate the usually drastic biases
in Tanimoto similarity and is hence not meant for training the prediction of those
scores.
The purpose is rather to show a high number of spectra to a model to learn
embedding evaluations.

Spectra are sampled in groups of size batch_size. Before every epoch the indexes are
shuffled at random. For selected spectra the tanimoto scores, ms2deepscore scores and
embeddings are returned.
"""

def __init__(self, spectrums: List[Spectrum],
ms2ds_model,
settings: SettingsEmbeddingEvaluator,
device="cpu",
):
"""

Parameters
----------
spectrums
List of matchms Spectrum objects.
settings
The available settings can be found in SettignsMS2Deepscore
"""
self.current_index = 0
self.settings = settings
self.spectrums = spectrums
self.inchikey14s = [s.get("inchikey")[:14] for s in spectrums]
self.ms2ds_model = ms2ds_model
self.device = device
self.ms2ds_model.to(self.device)
self.indexes = np.arange(len(self.spectrums))
self.batch_size = self.settings.evaluator_distribution_size
self.fingerprint_df = self.compute_fingerprint_dataframe(
self.spectrums,
fingerprint_type=self.ms2ds_model.model_settings.fingerprint_type,
fingerprint_nbits=self.ms2ds_model.model_settings.fingerprint_nbits
)

# Initialize random number generator
self.rng = np.random.default_rng(self.settings.random_seed)

self.on_epoch_end()

def __len__(self):
return int(np.floor(len(self.spectrums) / self.batch_size))

def __iter__(self):
return self

def __next__(self):
if self.current_index < self.__len__():
batch = self.__getitem__(self.current_index)
self.current_index += 1
return batch
self.current_index = 0 # make generator executable again
self.on_epoch_end()
raise StopIteration

def _compute_embeddings_and_scores(self, batch_index: int):
batch_size = self.batch_size
indexes = self.indexes[batch_index * batch_size:((batch_index + 1) * batch_size)]

spec_tensors, meta_tensors = tensorize_spectra([self.spectrums[i] for i in indexes],
self.ms2ds_model.model_settings)
embeddings = self.ms2ds_model.encoder(spec_tensors.to(self.device), meta_tensors.to(self.device))

ms2ds_scores = cosine_similarity_matrix(embeddings.cpu().detach().numpy(), embeddings.cpu().detach().numpy())

# Compute true scores
inchikeys = [self.inchikey14s[i] for i in indexes]
fingerprints = self.fingerprint_df.loc[inchikeys].to_numpy()

tanimoto_scores = jaccard_similarity_matrix(fingerprints, fingerprints)

return torch.tensor(tanimoto_scores), torch.tensor(ms2ds_scores), embeddings.cpu().detach()

def on_epoch_end(self):
"""Updates indexes after each epoch."""
self.rng.shuffle(self.indexes)

def __getitem__(self, batch_index: int):
"""Generate one batch of data.
"""
return self._compute_embeddings_and_scores(batch_index)

def compute_fingerprint_dataframe(self,
spectrums: List[Spectrum],
fingerprint_type,
fingerprint_nbits,
) -> pd.DataFrame:
"""Returns a dataframe with a fingerprints dataframe

spectrums:
A list of spectra
settings:
The settings that should be used for selecting the compound pairs wrapper. The settings should be specified as a
SettingsMS2Deepscore object.
"""
fingerprints, inchikeys14_unique = compute_fingerprints_for_training(
spectrums,
fingerprint_type,
fingerprint_nbits
)

fingerprints_df = pd.DataFrame(fingerprints, index=inchikeys14_unique)
return fingerprints_df
89 changes: 89 additions & 0 deletions ms2deepscore/train_new_model/SpectrumPairGenerator.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
import json
from collections import Counter
from typing import List, Tuple

import numpy as np
from matchms import Spectrum


class SpectrumPairGenerator:
def __init__(self, selected_inchikey_pairs: List[Tuple[str, str, float]], spectra,
shuffle: bool = True, random_seed: int = 0):
"""
Parameters
----------
selected_inchikey_pairs:
A list with tuples encoding inchikey pairs like: (inchikey1, inchikey2, tanimoto_score)
"""
self.selected_inchikey_pairs = selected_inchikey_pairs
self.spectra = spectra
self.spectrum_inchikeys = np.array([s.get("inchikey")[:14] for s in self.spectra])
self.shuffle = shuffle
self.random_nr_generator = np.random.default_rng(random_seed)
self._idx = 0
if self.shuffle:
self.random_nr_generator.shuffle(self.selected_inchikey_pairs)

def __iter__(self):
return self

def __next__(self):
# reshuffle when we've gone through everything
if self._idx >= len(self.selected_inchikey_pairs):
self._idx = 0
if self.shuffle:
self.random_nr_generator.shuffle(self.selected_inchikey_pairs)

inchikey1, inchikey2, tanimoto_score = self.selected_inchikey_pairs[self._idx]
spectrum1 = self._get_spectrum_with_inchikey(inchikey1, self.random_nr_generator)
spectrum2 = self._get_spectrum_with_inchikey(inchikey2, self.random_nr_generator)
self._idx += 1
return spectrum1, spectrum2, tanimoto_score

def __len__(self):
return len(self.selected_inchikey_pairs)

def __str__(self):
return f"SpectrumPairGenerator with {len(self.selected_inchikey_pairs)} pairs available"

def get_scores(self):
return [score for _, _, score in self.selected_inchikey_pairs]

def get_inchikey_counts(self) -> Counter:
"""returns the frequency each inchikey occurs"""
inchikeys = Counter()
for inchikey_1, inchikey_2, _ in self.selected_inchikey_pairs:
inchikeys[inchikey_1] += 1
inchikeys[inchikey_2] += 1
return inchikeys

def get_scores_per_inchikey(self):
inchikey_scores = {}
for inchikey_1, inchikey_2, score in self.selected_inchikey_pairs:
if inchikey_1 in inchikey_scores:
inchikey_scores[inchikey_1].append(score)
else:
inchikey_scores[inchikey_1] = []
if inchikey_2 in inchikey_scores:
inchikey_scores[inchikey_2].append(score)
else:
inchikey_scores[inchikey_2] = []
return inchikey_scores

def save_as_json(self, file_name):
data_for_json = [(item[0], item[1], float(item[2])) for item in self.selected_inchikey_pairs]

with open(file_name, "w", encoding="utf-8") as f:
json.dump(data_for_json, f)

def _get_spectrum_with_inchikey(self, inchikey: str, random_number_generator) -> Spectrum:
"""
Get a random spectrum matching the `inchikey` argument.

NB: A compound (identified by an
inchikey) can have multiple measured spectrums in a binned spectrum dataset.
"""
matching_spectrum_id = np.where(self.spectrum_inchikeys == inchikey)[0]
if len(matching_spectrum_id) <= 0:
raise ValueError("No matching inchikey found (note: expected first 14 characters)")
return self.spectra[random_number_generator.choice(matching_spectrum_id)]
Loading
Loading