Skip to content
This repository was archived by the owner on Oct 19, 2024. It is now read-only.
This repository was archived by the owner on Oct 19, 2024. It is now read-only.

improvement: reduce setup time of AbstractWriterCallback #88

@YoniSchirris

Description

@YoniSchirris

Describe the bug
When running inference, AbstractWriterCallback loops over all datasets to construct the _dataset_size dict. This opens a slide from cache several times, which can take 1-3 seconds. For a dataset of 1500 wsis this often takes 20 minutes.

To Reproduce
Run inference on-the-fly (#87) with your data_dir and glob_pattern set up to find many whole-slide images.

Expected behavior
You'll find that after printing the dataset statistics, it takes a long time to start setting up callback workers.

In my case

[2024-06-07 12:24:32,332][ahcore.data.dataset.DlupDataModule][INFO] - Dataset for stage predict has 773079 samples and the following statistics:
 - Mean: 485.30
 - Std: 145.56
 - Min: 48.00
 - Max: 1056.00
[2024-06-07 12:29:30,294][ahcore.callbacks.converters.common][INFO] - Starting worker for TiffConverterCallback

Environment
dlup version: 0.3.38
How installed: unsure
Python version: 3.11.9
Operating System: linux

Quick solution to reduce time by half;
in

for current_dataset in self._total_dataset.datasets: # type: ignore
change

assert current_dataset.slide_image.identifier
self._dataset_sizes[current_dataset.slide_image.identifier] = len(current_dataset)

to

current_dataset_slide_id = current_dataset.slide_image.identifier
assert current_dataset_slide_id
self._dataset_sizes[current_dataset_slide_id] = len(current_dataset)

which will likely reduce the time by half

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions