Skip to content
This repository was archived by the owner on Oct 20, 2025. It is now read-only.

Conversation

@dsammour
Copy link

  • Added a new MassSpectrumOnDisk class which inherits AbstractMassObject class and whose Mass and Intensity slots are of "matter::matter_vec" class.
  • The Mass and Intensity slots of AbstractMassObject have a classUnion type of "numeric" and "matter_vec".
  • Edited methods of AbstractMassObject to comply with MassSpectrumOnDisk objects.
  • trimming and subsetting create in-memory MassSpectrum objects when called on the new type.
  • Edited documentation accordingly.

* Added a new MassSpectrumOnDisk class  which inherits AbstractMassObject class and whose Mass and Intensity slots are of "matter::matter_vec" class.
* The Mass and Intensity slots of AbstractMassObject have a classUnion type of "numeric" and "matter_vec".
* Edited methods of AbstractMassObject to comply with MassSpectrumOnDisk objects.
* trimming and subsetting create in-memory MassSpectrum objects when called on the new type.
Copy link
Owner

@sgibb sgibb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dear Denis,

thanks for your effort making this great PR. While reviewing it I developed some new thoughts that I like to discuss with you. Please don't be upset because of these many comments/request for changes.

  1. MALDIquant is on CRAN. If we depend on an Bioconductor package nobody can install MALDIquant with install.packages anymore (without setting the repositories manually or maybe the biocViews: trick works but I am not sure).
  2. Do we really need matter here? matter is a great package that offers a lot of cool stuff (chunking, basic operations like mean, sum, etc. on files, ...) but for our use case we just read/write single spectra to disk/memory (a single spectrum should always fit into memory). We could easily mimic matter with a few lines of code and avoid the Bioconductor/CRAN hassle and don't add other unnecessary dependencies (BiocParallel, DBI, biglm, ...). The speed is comparable: https://gist.github.com/sgibb/8c8ace205ae566680cd08022b49fc790
  3. I don't like the current focus on IMS data. Normal MS profile data could be very large, too. If we provide an on-disk mode I would welcome a solution that works for all kind of data.
  4. Please add some basic unit test to compare the functionality of the new data type with the old one.

General problems:
What happens if the user does the following?:

odms <- createMassSpectrumOnDisk(1:10, 1:10)
odmsCopy <- odms
## both operations will write to the same files without informing the user
odms <- calibrateIntensity(odms)
odmsCopy <- removeBaseline(odmsCopy)

BTW: We are working on something similar for MSnbase (https://github.com/lgatto/MSnbase/projects/2) but use different file based backends (namely mzML, Hdf5, all of the introduce new Bioconductor dependencies of course.) I don't think MALDIquant has do follow the same way.
BTW2: I didn't read the documentation you wrote because I guess it has to be changed after modifying this PR.

email="mail@sebastiangibb.de",
comment=c(ORCID="0000-0001-7406-4443")), person("Korbinian",
"Strimmer", role="ths"))
biocViews:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary to trick R to install dependencies from Bioconductor, right? (I am not sure that this works/will be accepted on CRAN)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reply/review Sebastian.

I just tested this thing; it does not work, although I found a lot of people suggesting this trick! I see your point on removing the dependency on matter. As you showed in your onDiskVec example, we just need to write and read one vector at a time. The onDiskVec should have path, offset, size and length slots to account for imzML datatype.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, if that don't work I don't really want to "Depend" or "Import" a Bioconductor package if it is not really necessary. If it would be possible to move it to "Suggests" it would be fine (but I am not sure that this is really possible).

I add a new branch: https://github.com/sgibb/MALDIquant/tree/OnDiskVector

If you like you could use and play with the new OnDiskVector class. It should work for both non-imzML and imzML data (it supports single files and offsets):

library("MALDIquant")

odv <- OnDiskVector(1:10)
odv@path
# [1] "/tmp/Rtmpaz3SQG/file13ce3a31c136"
odv[1:5]
# [1] 1 2 3 4 5

f <- system.file(
    file.path("exampledata", "tiny_continuous.ibd"),
    package="MALDIquantForeign"
)
odv2 <- OnDiskVector(path=f, offset=56, n=5)
odv2@path
# [1] "/home/sebastian/opt/R/lib/R/library/MALDIquantForeign/exampledata/tiny_continuous.ibd"
odv2[1:5]
# [1]  6  7  8  9 10

Implementation could be found in https://github.com/sgibb/MALDIquant/blob/OnDiskVector/R/OnDiskVector-class.R

(Currently I have no time to really finish this thing. There is still the problem with odv2 <- odv and a subsequently different change of both vectors on the same file.)

Copy link
Owner

@sgibb sgibb Mar 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use a similar approach as we discussed for MSnbase in lgatto/MSnbase#429. TLDR; we store a modification counter in the object and in an additional file (in MSnbase we use the same file but that won't work for existing idb files). The modification counter is incremented if [<- is called, e.g.:

odv <- OnDiskVector(1:3)
odv2 <- odv
odv2[] <- 4:6
odv[]
# Error in .isModified.OnDiskVector(x) :
#   /tmp/RtmpKh7TLm/file54ec578fcbd3 was modified by a different object.
odv2[]
# [1] 4 5 6

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reply Sebastian & sorry for the delayed reply.

So the way to go now, is to replicate whatever I did with the matter package but using the onDiskVector Class instead on the newly opened branch, right?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you satisfied with the OnDiskVector class that would be the way to go.


## represent an On-disk spectrum
setClass("MassSpectrumOnDisk",
slots = list(path = "character"),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this slot is needed? Isn't the file path part of the matter_vec object and easily accessible by matter::paths(matter_vec_object)?

Copy link
Author

@dsammour dsammour Mar 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but I added it for convenience; not to call additional functions. If we opt to removing matter altogether, then it will have to stay.

}
TRUE

tmpMass <- mass(object)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should rethink the validity check for MassSpectrumOnDisk because reading all the data from disk, just to confirm there are of equal length and don't have negative values will slow down everything. I have to rethink it a little bit.

Copy link
Author

@dsammour dsammour Mar 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is possible to use length and any methods for finding length and negative values of matter_vec, respectively.

test = matter::matter_vec(seq(-5, 5))
length(test) # 11
any(test < 0) # TRUE

is.numeric though wont work. An alternative would be matter::datamode(test) but for this we need to have a separate method. is.unsorted wont work.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

odv <- OnDiskVector(seq(-5, 5))
length(odv)
# [1] 11
any(odv[] < 0)
# [1] TRUE

set.seed(1:3)
s <- sample(1e5)
o <- OnDiskVector(s)

library("matter")
m <- matter_vec(s)

library("microbenchmark")
microbenchmark(any(m < 0), any(o[] < 0))
# Unit: microseconds
#          expr      min       lq     mean   median       uq      max neval
#    any(m < 0) 3509.873 3855.356 4268.571 4281.766 4575.449 6589.845   100
#  any(o[] < 0)  821.973  888.878 1268.179 1056.025 1390.142 7355.973   100


importFrom("matter",
"matter_vec",
"matter_fc")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any code that uses matter_fc?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remove, when building the package a warning is always thrown:

Warning: subclass "matter_fc" of class "matter_vec" is not local and cannot be updated for new inheritance information; consider setClassUnion()

sgibb and others added 2 commits March 4, 2019 11:26
Co-Authored-By: dsammour <44615117+dsammour@users.noreply.github.com>
Co-Authored-By: dsammour <44615117+dsammour@users.noreply.github.com>
@dsammour
Copy link
Author

dsammour commented Mar 4, 2019

Thanks for your reply/review Sebastian.

I see your point on removing the dependency on matter. As you showed in your onDiskVec example, we just need to write and read one vector at a time. The onDiskVec should have path, offset, size and length slots to account for imzML datatype. However, if we do this, we'll lose the meta data and length method available in a matter_vec object and therefore, every time we need to do some check on length, etc. the whole vector has to be read. Is there a good solution for that for onDiskVec?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants