This repository contains the implementations created in my master thesis with the title "Mining biochemistry in metabolomics MS/MS data"

This repository contains the R-code created as part of my master thesis. The "R"-folder contains the final implementations, while the one of the datasets used is stored in the "data"-folder. This dataset was created by simulating structures of modified compounds and creating their in-silico spectra using CFM-ID predict (https://doi.org/10.1007/s11306-014-0676-4 , https://doi.org/10.1093/nar/gkac383). The folder "scrips_results" contains all additional R-scripts created to make the results displayed in discussed in the thesis, as well as some additional figures.

This README.md will only present the R-scripts within the "R"-folder. The process of the data creation is explained in data/Data_creation.md.

data_cleaning.R

This file contains a method to clean spectral data stored in Spectra-objects, by removing peaks with an intensity below a defined threshold and spectra with less than a defined number of peaks. Peaks with the lowest intensity are removed, if the spectrum contains more peaks than a given number. This file additionally contains a method to normalize a spectrum by dividing its intensities by the highest intensity in the spectrum.

data_prefiltering.R

This file uses and extens some existing Spectra methods to

filter the dataset for spectra with a specific precursor mass and shifted precursor mass using the method Spectra::filterPrecursorMzValues()
filter the dataset for spectra with specific peaks or shifted peaks using the method Spectra::containsMz()
filter the dataset for spectra with specific neutral losses using the method Spectra::containsNeutralLoss(). The returned spectra either contain one or all of the given neutral losses.

plotSpectra.R

There are three different possibilities to plot Spectra objects using Spectra. This R-script was created to combine attributes of the Spectra::plotSpectra() and Spectra::plotSpectraMirror() methods, to display two or four spectra within one figure. The intensity of the spectra on the bottom are mirrored, while the spectra on top are displayed normally. The m/z range is automatically set for both spectra to align peaks with an identical m/z value. A figure displaying the created methods can be found in the "data"-folder. The colors of the peaks were set and the lines between matching peaks were added later manually.

get_modifications.R

This script can be used to create DataFrames containing possible modifications. Two sets of expected modifications are given, containing hydroxylation, oxidation, hydroxylation+oxidation, loss of H2 and glycosylation. The second set additionally contains a methylation, ethylation and acetylation. Combinations of these modifications are created within the given methods. The returned DataFrame contains a description of the modifications in the first column, the total mass of combinations in the second and the mass of individual modifications in the remaining columns.

The method possible_modifications() creates all 1 to n combinations of the given modifications and only returns unique modifications as a default. Duplications in this case correspond to one line containing the modifications hydroxylation and oxidation, while the second contains oxidation and hydroxylation.

The method set_modifications() creates a list of commonly observed modifications in diterpenes with an abietane backbone as described by Frey et al in (https://doi.org/10.1016/j.ymben.2024.02.006). Hydroxylations were additionally added, but their co-occurrence with hydroxylations and oxidations was excluded. The loss of H2 can be included in the list of possible modifications or not.

calc_mMod_cossim.R

This script contains the main implementations of this thesis. The multiple modified cosine score was created as part of my thesis to extend the modified cosine and overcome its known limitation. The modified cosine score takes the precursor mass difference into account when matching peaks, this can results in higher spectral similarities when comparing non-identical but related spectra. It is possible that the precursor mass difference corresponds to multiple modifications, resulting in peaks being shifted by the mass of individual modifications. This can not be captured with the modified cosine. The multiple modified cosine is applied if the precursor mass difference is equal to the mass of possible modifications, a list of such possible modifications can be created using the get_modifications.R-script. The multiple modified cosine score matches peaks if they are shifted by the mass of individual or combinations of multiple modifications, in addition to matching peaks with identical m/z values and peaks that are shifted by the precursor mass difference.

The main method of the calc_mMod_cossim.R-script is calc_cossim() which can be used to calculate the cosine, modified cosine or multiple modified cosine score for two spectra stored in individual Spectra-objects. The method additionally requires the relative and absolute mass tolerance, the calculated score ("PlainCosine" for the cosine, "ModCosine" for the modified cosine" or "mModCosine" for the multiple modified cosine) and the method to use when creating the final pairs of peaks (matches can be "single", "multiple" or "max_intensity"). When calculating the multiple modified cosine score, a list containing all possible modifications is additionally required. The extra parameter incl_prec can optionally be set, this parameter will be explained later. The calc_cossim() method uses the maximum of absolute mass tolerance and the relative mass tolerance for each m/z value, the default is set to 5 ppm for the relative and 0.001 Da for the absolute mass tolerance.

The final spectral similarity score is calculated using the following equation:
$\frac{\sum_i \sqrt{a_i} \times \sqrt{b_i}}{\sqrt{a_{total}} \times \sqrt{b_{total}}}$.
$a_{total}$ and $b_{total}$ correspond to the total sum of intensities in the two spectra, while $a_i$ and $b_i$ describe the intensity of the matched peaks in the i-th pair.

The matches parameter of the calc_cossim() method defines the applied method to select the final set of matched peaks.

"single" can be used to select the final pairs of peaks based on their mass difference, while only allowing single matches.
"multiple" uses all possible pairs of peaks to calculate the final similarity score. It is possible that a peak is matched more than once when using different mass shifts to compare peaks.
"max_intensity" calculated the subscore of each possible pair of peaks and uses the method clue::solveLSAP() to select the final pairs. This is also done in MsCoreUtils::gnps(). This option also only allows single matches.

The incl_prec parameter can be set to "TRUE" or "FALSE", with the default being "FALSE". This parameter can be used to differentiate whether to match the precursor ion peaks, if they exist, when calculating the cosine or modified cosine similarity. If this parameter is set to

"TRUE" the precursor ion peaks are allowed to match each other and their intensities are used in the final score calculation. This is done in MsCoreUtils::gnps().
"FALSE" the precursor ion peaks are not allowed to match each other and their intensity is only included in the total sum of intensities if they are matched to any other peak. This option was included to exclude this match between the precursor ion peaks, since it can almost always be observed when calculating the cosine and modified cosine similarity. The reason for this is, that their mass difference corresponds to the precursor mass difference.

The calc_mMod_Cossim.R-script contains some additional methods, some of them are used by the calc_cossim() method while other can be used to create similarity matrices or get all multiple modified cosine scores and identified modifications.

The method calc_dismat() can be used to create the similarity matrix of spectra within one Spectra-object, by calculating the pairwise similarities of all pairs of spectra. This method only fills the upper triangle and the main diagonal of the similarity matrix, the spectrum with the higher precursor mass is alwas used as the first spectrum when calling the calc_cossim() method.
The method calc_distmat_different() can be used to create the similarity matrix of two Spectra-objects, by calculating all pairwise similarities. This method also always selects the spectrum with the higher precursor mass as the first spectrum.
The method create_edges_multimod() can be used to compare the multiple modified cosine score between a pair of spectra for different matched multiple modifications. It is possible that different combinations of modifications result in the same total mass, if used to calculate the multiple modified cosine these different possibilities can result in different scores.
The method calc_edges() is identical to calc_dismat() but it calls the create_edges_multimod() method instead of the calc_cossim() method. The same applies for calc_edges_different() and calc_distmat_different().

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
R		R
data		data
scrips_results		scrips_results
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

This repository contains the implementations created in my master thesis with the title "Mining biochemistry in metabolomics MS/MS data"

data_cleaning.R

data_prefiltering.R

plotSpectra.R

get_modifications.R

calc_mMod_cossim.R

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

This repository contains the implementations created in my master thesis with the title "Mining biochemistry in metabolomics MS/MS data"

data_cleaning.R

data_prefiltering.R

plotSpectra.R

get_modifications.R

calc_mMod_cossim.R

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages