Skip to content

multithread digest'ing of zarr folders #913

@yarikoptic

Description

@yarikoptic

ATM, if I run dandi digest on a hot (was done before, so IO is fast) folder, dandi digest gets just 30% CPU busy and takes > 20 sec, whenever a parallelized example for checksumming takes about x20 times less, and goes above 100% CPU.

(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ time DANDI_CACHE=ignore dandi digest -d zarr-checksum test64.ngff/0/0/0/0
test64.ngff/0/0/0/0: 53ad9d9fe8b4f34882fdaea76599da22
2022-02-16 18:51:44,202 [    INFO] Logs saved in /home/jovyan/.cache/dandi-cli/log/20220216185120Z-4990.log

real    0m24.034s
user    0m8.524s
sys     0m5.120s
(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ time DANDI_CACHE=ignore dandi digest -d zarr-checksum test64.ngff/0/0/0/0
test64.ngff/0/0/0/0: 53ad9d9fe8b4f34882fdaea76599da22
2022-02-16 18:52:24,406 [    INFO] Logs saved in /home/jovyan/.cache/dandi-cli/log/20220216185202Z-5127.log

real    0m22.499s
user    0m8.369s
sys     0m6.190s

(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ time python /shared/io-utils/fastio_md5.py test64.ngff/0/0/0/0
Total: 7200

real    0m1.358s
user    0m1.545s
sys     0m2.164s

related PR introducing multithreaded walk in fscacher (benefit is not yet 100% clear): con/fscacher#67

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions