Describe the bug
When running inference, AbstractWriterCallback loops over all datasets to construct the _dataset_size dict. This opens a slide from cache several times, which can take 1-3 seconds. For a dataset of 1500 wsis this often takes 20 minutes.
To Reproduce
Run inference on-the-fly (#87) with your data_dir and glob_pattern set up to find many whole-slide images.
Expected behavior
You'll find that after printing the dataset statistics, it takes a long time to start setting up callback workers.
In my case
[2024-06-07 12:24:32,332][ahcore.data.dataset.DlupDataModule][INFO] - Dataset for stage predict has 773079 samples and the following statistics:
- Mean: 485.30
- Std: 145.56
- Min: 48.00
- Max: 1056.00
[2024-06-07 12:29:30,294][ahcore.callbacks.converters.common][INFO] - Starting worker for TiffConverterCallback
Environment
dlup version: 0.3.38
How installed: unsure
Python version: 3.11.9
Operating System: linux
Quick solution to reduce time by half;
in
|
for current_dataset in self._total_dataset.datasets: # type: ignore |
change
assert current_dataset.slide_image.identifier
self._dataset_sizes[current_dataset.slide_image.identifier] = len(current_dataset)
to
current_dataset_slide_id = current_dataset.slide_image.identifier
assert current_dataset_slide_id
self._dataset_sizes[current_dataset_slide_id] = len(current_dataset)
which will likely reduce the time by half