-
Notifications
You must be signed in to change notification settings - Fork 24
MassSpectrumOnDisk Functionality #60
base: master
Are you sure you want to change the base?
MassSpectrumOnDisk Functionality #60
Conversation
dsammour
commented
Feb 26, 2019
- Added a new MassSpectrumOnDisk class which inherits AbstractMassObject class and whose Mass and Intensity slots are of "matter::matter_vec" class.
- The Mass and Intensity slots of AbstractMassObject have a classUnion type of "numeric" and "matter_vec".
- Edited methods of AbstractMassObject to comply with MassSpectrumOnDisk objects.
- trimming and subsetting create in-memory MassSpectrum objects when called on the new type.
- Edited documentation accordingly.
* Added a new MassSpectrumOnDisk class which inherits AbstractMassObject class and whose Mass and Intensity slots are of "matter::matter_vec" class. * The Mass and Intensity slots of AbstractMassObject have a classUnion type of "numeric" and "matter_vec". * Edited methods of AbstractMassObject to comply with MassSpectrumOnDisk objects. * trimming and subsetting create in-memory MassSpectrum objects when called on the new type.
sgibb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dear Denis,
thanks for your effort making this great PR. While reviewing it I developed some new thoughts that I like to discuss with you. Please don't be upset because of these many comments/request for changes.
MALDIquantis on CRAN. If we depend on an Bioconductor package nobody can installMALDIquantwithinstall.packagesanymore (without setting the repositories manually or maybe thebiocViews:trick works but I am not sure).- Do we really need
matterhere?matteris a great package that offers a lot of cool stuff (chunking, basic operations likemean,sum, etc. on files, ...) but for our use case we just read/write single spectra to disk/memory (a single spectrum should always fit into memory). We could easily mimicmatterwith a few lines of code and avoid the Bioconductor/CRAN hassle and don't add other unnecessary dependencies (BiocParallel,DBI,biglm, ...). The speed is comparable: https://gist.github.com/sgibb/8c8ace205ae566680cd08022b49fc790 - I don't like the current focus on IMS data. Normal MS profile data could be very large, too. If we provide an on-disk mode I would welcome a solution that works for all kind of data.
- Please add some basic unit test to compare the functionality of the new data type with the old one.
General problems:
What happens if the user does the following?:
odms <- createMassSpectrumOnDisk(1:10, 1:10)
odmsCopy <- odms
## both operations will write to the same files without informing the user
odms <- calibrateIntensity(odms)
odmsCopy <- removeBaseline(odmsCopy)BTW: We are working on something similar for MSnbase (https://github.com/lgatto/MSnbase/projects/2) but use different file based backends (namely mzML, Hdf5, all of the introduce new Bioconductor dependencies of course.) I don't think MALDIquant has do follow the same way.
BTW2: I didn't read the documentation you wrote because I guess it has to be changed after modifying this PR.
| email="mail@sebastiangibb.de", | ||
| comment=c(ORCID="0000-0001-7406-4443")), person("Korbinian", | ||
| "Strimmer", role="ths")) | ||
| biocViews: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is necessary to trick R to install dependencies from Bioconductor, right? (I am not sure that this works/will be accepted on CRAN)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your reply/review Sebastian.
I just tested this thing; it does not work, although I found a lot of people suggesting this trick! I see your point on removing the dependency on matter. As you showed in your onDiskVec example, we just need to write and read one vector at a time. The onDiskVec should have path, offset, size and length slots to account for imzML datatype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, if that don't work I don't really want to "Depend" or "Import" a Bioconductor package if it is not really necessary. If it would be possible to move it to "Suggests" it would be fine (but I am not sure that this is really possible).
I add a new branch: https://github.com/sgibb/MALDIquant/tree/OnDiskVector
If you like you could use and play with the new OnDiskVector class. It should work for both non-imzML and imzML data (it supports single files and offsets):
library("MALDIquant")
odv <- OnDiskVector(1:10)
odv@path
# [1] "/tmp/Rtmpaz3SQG/file13ce3a31c136"
odv[1:5]
# [1] 1 2 3 4 5
f <- system.file(
file.path("exampledata", "tiny_continuous.ibd"),
package="MALDIquantForeign"
)
odv2 <- OnDiskVector(path=f, offset=56, n=5)
odv2@path
# [1] "/home/sebastian/opt/R/lib/R/library/MALDIquantForeign/exampledata/tiny_continuous.ibd"
odv2[1:5]
# [1] 6 7 8 9 10Implementation could be found in https://github.com/sgibb/MALDIquant/blob/OnDiskVector/R/OnDiskVector-class.R
(Currently I have no time to really finish this thing. There is still the problem with odv2 <- odv and a subsequently different change of both vectors on the same file.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use a similar approach as we discussed for MSnbase in lgatto/MSnbase#429. TLDR; we store a modification counter in the object and in an additional file (in MSnbase we use the same file but that won't work for existing idb files). The modification counter is incremented if [<- is called, e.g.:
odv <- OnDiskVector(1:3)
odv2 <- odv
odv2[] <- 4:6
odv[]
# Error in .isModified.OnDiskVector(x) :
# /tmp/RtmpKh7TLm/file54ec578fcbd3 was modified by a different object.
odv2[]
# [1] 4 5 6There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your reply Sebastian & sorry for the delayed reply.
So the way to go now, is to replicate whatever I did with the matter package but using the onDiskVector Class instead on the newly opened branch, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you satisfied with the OnDiskVector class that would be the way to go.
|
|
||
| ## represent an On-disk spectrum | ||
| setClass("MassSpectrumOnDisk", | ||
| slots = list(path = "character"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this slot is needed? Isn't the file path part of the matter_vec object and easily accessible by matter::paths(matter_vec_object)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, but I added it for convenience; not to call additional functions. If we opt to removing matter altogether, then it will have to stay.
| } | ||
| TRUE | ||
|
|
||
| tmpMass <- mass(object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should rethink the validity check for MassSpectrumOnDisk because reading all the data from disk, just to confirm there are of equal length and don't have negative values will slow down everything. I have to rethink it a little bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is possible to use length and any methods for finding length and negative values of matter_vec, respectively.
test = matter::matter_vec(seq(-5, 5))
length(test) # 11
any(test < 0) # TRUE
is.numeric though wont work. An alternative would be matter::datamode(test) but for this we need to have a separate method. is.unsorted wont work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
odv <- OnDiskVector(seq(-5, 5))
length(odv)
# [1] 11
any(odv[] < 0)
# [1] TRUE
set.seed(1:3)
s <- sample(1e5)
o <- OnDiskVector(s)
library("matter")
m <- matter_vec(s)
library("microbenchmark")
microbenchmark(any(m < 0), any(o[] < 0))
# Unit: microseconds
# expr min lq mean median uq max neval
# any(m < 0) 3509.873 3855.356 4268.571 4281.766 4575.449 6589.845 100
# any(o[] < 0) 821.973 888.878 1268.179 1056.025 1390.142 7355.973 100|
|
||
| importFrom("matter", | ||
| "matter_vec", | ||
| "matter_fc") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any code that uses matter_fc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remove, when building the package a warning is always thrown:
Warning: subclass "matter_fc" of class "matter_vec" is not local and cannot be updated for new inheritance information; consider setClassUnion()
Co-Authored-By: dsammour <44615117+dsammour@users.noreply.github.com>
Co-Authored-By: dsammour <44615117+dsammour@users.noreply.github.com>
|
Thanks for your reply/review Sebastian. I see your point on removing the dependency on matter. As you showed in your onDiskVec example, we just need to write and read one vector at a time. The onDiskVec should have path, offset, size and length slots to account for imzML datatype. However, if we do this, we'll lose the meta data and length method available in a |