Conversation
|
in practice i've noticed that populating the db can take 10 minutes for 1500 wsis, which happens when initializing the datamodule, which initializes the datamanger, which immediately populates teh DB the biggest problem here was opening each slide with dlup to extract the mpp, width, and height, which is completely irrelevant for our task here. the for now i've made the minimal image even more minimal; it only contains the fp to the slide. we may also want to think about when to populate the db, which may be a bad design choice to do during datamodule initialization. Honestly, we can even forego the entire DB generation, and just within no database models, no engine, no session required. If we do awnt to keep the database, because it might add some more functionality later (e.g. when doing feature extraction w/ a mask?) we may want to open it, populate it, and close it, all during the call of |
| assert current_dataset.slide_image.identifier | ||
| self._dataset_sizes[current_dataset.slide_image.identifier] = len(current_dataset) | ||
| curr_filename = current_dataset._path | ||
| assert curr_filename |
There was a problem hiding this comment.
The assertion is redundant though, since a tiledwsidataset always has a path.
Fixes #73.
The commit contains some minor comments that need quick fixing.
This PR implements generating an in-memory database on-the-fly.
This is a useful feature if you want to, e.g., run inference using a segmentation model on a set of slides that you do not with to generate a complete database for.
That is exactly the use-case that it is designed for; running inference of a segmentation model on a glob of slides from a directory, taking only the slide as input (no masks, annotations, labels, patient information).
To achieve this, I have
OnTheFlyDataDescriptionclass, which contains fewer arguments thanDataDescriptionOnTheFlyDataDescriptionis that thedata_diris used together with theglob_pattern. This searches for WSIs, populates the in-memory DB with this on-the-fly, which is used by the inference pipeline.DataManagergets an_on_the_flyproperty that is set by checking thedata_descriptionclass.engine, in contrast to the saved DB where aurifrom thedata_descriptionis used to load the DBengineor auridepending on the use-caseDataManager'sget_all_imagesfunction, which is only implemented for aMinimalImagetableMinimalImagetable is the sole, and very minimal, part of the on-the-fly DBdata_descriptionnow uses a wrapper functioncreate_datasets_from_data_description, which, based on the (OnTheFly)DataDescriptionclass, generates a dataset withdatasets_from_data_description_with_uri, which assumes a fully populated DB, anddatasets_from_on_the_fly_data_description, which assumes a minimally populated DB with only theMinimalimagetableopenslide-pythondirectories and decided to add them explicitly here, which is likely easier and may be used for any testPossible limitations
DataManagerto use an engine for bothDataDescriptions to open a session from the same input which may be easier to read and possibly add new features/refactors to theDataManager. In one case it creates an engine by creating a db and populating it. In the other case it creates an engine by reading it from the uri. And the session just opens the sessoin from the engine instead of first creating an engine from the uri and then returning the session to which the engine is boundDLUPversion has a bug concerning the pyramidal format of the generated segmentation map