-
Notifications
You must be signed in to change notification settings - Fork 33
multithread digest'ing of zarr folders #913
Copy link
Copy link
Closed
Labels
Description
ATM, if I run dandi digest on a hot (was done before, so IO is fast) folder, dandi digest gets just 30% CPU busy and takes > 20 sec, whenever a parallelized example for checksumming takes about x20 times less, and goes above 100% CPU.
(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ time DANDI_CACHE=ignore dandi digest -d zarr-checksum test64.ngff/0/0/0/0
test64.ngff/0/0/0/0: 53ad9d9fe8b4f34882fdaea76599da22
2022-02-16 18:51:44,202 [ INFO] Logs saved in /home/jovyan/.cache/dandi-cli/log/20220216185120Z-4990.log
real 0m24.034s
user 0m8.524s
sys 0m5.120s
(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ time DANDI_CACHE=ignore dandi digest -d zarr-checksum test64.ngff/0/0/0/0
test64.ngff/0/0/0/0: 53ad9d9fe8b4f34882fdaea76599da22
2022-02-16 18:52:24,406 [ INFO] Logs saved in /home/jovyan/.cache/dandi-cli/log/20220216185202Z-5127.log
real 0m22.499s
user 0m8.369s
sys 0m6.190s
(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ time python /shared/io-utils/fastio_md5.py test64.ngff/0/0/0/0
Total: 7200
real 0m1.358s
user 0m1.545s
sys 0m2.164s
related PR introducing multithreaded walk in fscacher (benefit is not yet 100% clear): con/fscacher#67
Reactions are currently unavailable