-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Summary
This is a follow up to my fresh post
following to worktree exploration work by @just-meng, which then boiled down to me not finding in my search worktree in babs, thus suggesting to investigate that feature of git, for which git-annex support already exists!
and git worktree has "support" for sparse checkout
❯ git worktree --help | grep -A4 -e '--\[no.*checkout'
--[no-]checkout
By default, add checks out <commit-ish>, however, --no-checkout can be used to suppress checkout in
order to make customizations, such as configuring sparse-checkout. See "Sparse checkout" in git-
read-tree(1).
so it might be boiling down to just giving it a shot - as e.g. cloning "large" dataset with annexed content, like https://github.com/OpenNeuroDatasets/ds000003 and doing some rudimentary datalad run (e.g. md5sum invocation) on --inputs being one sub/ folder...
ok - gave to claude and it created this demo, where it even figured to use --explicit for datalad run
https://www.oneukrainian.com/tmp/sparse-worktree-datalad-run.sh
which outputs ...
> git worktree add --no-checkout .worktrees/wt-sub-01
Preparing worktree (new branch 'wt-sub-01')
> cd .worktrees/wt-sub-01
> git sparse-checkout init --cone
> git sparse-checkout set sub-01 code
> git checkout
> echo '=== Worktree sparse content (top-level + sub-01 + code/) ==='
=== Worktree sparse content (top-level + sub-01 + code/) ===
> ls
CHANGES dataset_description.json participants.tsv README sub-01 task-rhymejudgment_bold.json
> echo '=== Subject folder ==='
=== Subject folder ===
> ls sub-01/
anat func
> git annex info
trusted repositories: 0
semitrusted repositories: 6
00000000-0000-0000-0000-000000000001 -- web
00000000-0000-0000-0000-000000000002 -- bittorrent
7caccbd2-81a6-49e5-a339-66fb9e9a4f36 -- [s3-PUBLIC]
a95d8e05-b3fd-49a3-834f-9feb936b7c1e -- root@107c430b4146:/datalad/ds000003
e0770190-db4e-4e8c-a9a1-4d83fb56acfd -- s3-PRIVATE
e8b6cb2e-5774-44ed-99bc-a1f6605c0131 -- yoh@bilena:~/.tmp/dl-sparse-wt-XXWN8j1/ds000003 [here]
untrusted repositories: 0
transfers in progress: none
available local disk space: 7.31 gigabytes (+0 bytes reserved)
local annex keys: 0
local annex size: 0 bytes
annexed files in working tree: 3
size of annexed files in working tree: 31.74 megabytes
combined annex size of all repositories: 826.61 megabytes
annex sizes of repositories:
413.3 MB: 7caccbd2-81a6-49e5-a339-66fb9e9a4f36 -- [s3-PUBLIC]
413.3 MB: a95d8e05-b3fd-49a3-834f-9feb936b7c1e -- root@107c430b4146:/datalad/ds000003
backend usage:
MD5E: 3
bloom filter size: 32 mebibytes (0% full)
> git annex init
init ok
> datalad get sub-01/
/home/yoh/proj/datalad/trash/.venv/lib/python3.13/site-packages/requests/__init__.py:113: RequestsDependencyWarning: urllib3 (2.6.3) or chardet (6.0.0.post1)/charset_normalizer (3.4.4) doesn't match a supported version!
warnings.warn(
get(ok): sub-01/anat/sub-01_inplaneT2.nii.gz (file) [from s3-PUBLIC...]
get(ok): sub-01/anat/sub-01_T1w.nii.gz (file) [from s3-PUBLIC...]
get(ok): sub-01/func/sub-01_task-rhymejudgment_bold.nii.gz (file) [from s3-PUBLIC...]
get(ok): sub-01 (directory)
action summary:
get (ok: 4)
> echo '=== After get ==='
=== After get ===
> ls -la sub-01/
total 0
drwxrwsr-x 1 yoh yoh 16 Mar 2 11:51 .
drwxrwsr-x 1 yoh yoh 210 Mar 2 11:51 ..
drwxrwsr-x 1 yoh yoh 80 Mar 2 11:51 anat
drwxrwsr-x 1 yoh yoh 146 Mar 2 11:51 func
> ls -la sub-01/anat/ sub-01/func/
sub-01/anat/:
total 8
drwxrwsr-x 1 yoh yoh 80 Mar 2 11:51 .
drwxrwsr-x 1 yoh yoh 16 Mar 2 11:51 ..
lrwxrwxrwx 1 yoh yoh 138 Mar 2 11:51 sub-01_inplaneT2.nii.gz -> ../../.git/annex/objects/J3/q7/MD5E-s664614--0f8bc47f9c3047b340abfcd3ce1fb021.nii.gz/MD5E-s664614--0f8bc47f9c3047b340abfcd3ce1fb021.nii.gz
lrwxrwxrwx 1 yoh yoh 140 Mar 2 11:51 sub-01_T1w.nii.gz -> ../../.git/annex/objects/jJ/2v/MD5E-s5712417--0d1e0a7ff7063250404f45a955a66203.nii.gz/MD5E-s5712417--0d1e0a7ff7063250404f45a955a66203.nii.gz
sub-01/func/:
total 8
drwxrwsr-x 1 yoh yoh 146 Mar 2 11:51 .
drwxrwsr-x 1 yoh yoh 16 Mar 2 11:51 ..
lrwxrwxrwx 1 yoh yoh 142 Mar 2 11:51 sub-01_task-rhymejudgment_bold.nii.gz -> ../../.git/annex/objects/29/WX/MD5E-s25362403--d72afc284b7608e29ab1e92c75513f69.nii.gz/MD5E-s25362403--d72afc284b7608e29ab1e92c75513f69.nii.gz
-rw-rw-r-- 1 yoh yoh 1418 Mar 2 11:51 sub-01_task-rhymejudgment_events.tsv
> mkdir -p derivatives/checksums
> datalad run -m 'Compute md5sums for sub-01' --explicit --input 'sub-01/**' --output derivatives/checksums/sub-01_md5sums.txt -- bash -c 'find -L sub-01/ -type f | sort | xargs md5sum > derivatives/checksums/sub-01_md5sums.txt'
/home/yoh/proj/datalad/trash/.venv/lib/python3.13/site-packages/requests/__init__.py:113: RequestsDependencyWarning: urllib3 (2.6.3) or chardet (6.0.0.post1)/charset_normalizer (3.4.4) doesn't match a supported version!
warnings.warn(
[INFO ] Making sure inputs are available (this may take some time)
[INFO ] == Command start (output follows) =====
[INFO ] == Command exit (modification check follows) =====
run(ok): /home/yoh/.tmp/dl-sparse-wt-XXWN8j1/ds000003/.worktrees/wt-sub-01 (dataset) [bash -c 'find -L sub-01/ -type f | sort ...]
add(ok): derivatives/checksums/sub-01_md5sums.txt (file)
save(ok): . (dataset)
> echo '=== Result ==='
=== Result ===
> cat derivatives/checksums/sub-01_md5sums.txt
0f8bc47f9c3047b340abfcd3ce1fb021 sub-01/anat/sub-01_inplaneT2.nii.gz
0d1e0a7ff7063250404f45a955a66203 sub-01/anat/sub-01_T1w.nii.gz
d72afc284b7608e29ab1e92c75513f69 sub-01/func/sub-01_task-rhymejudgment_bold.nii.gz
6c0f223452837e1878466eb935ddffc8 sub-01/func/sub-01_task-rhymejudgment_events.tsv
> echo '=== git log ==='
=== git log ===
> git log --oneline -5
4954b7b (HEAD -> wt-sub-01) [DATALAD RUNCMD] Compute md5sums for sub-01
c090537 (tag: 1.0.0, origin/master, origin/HEAD, master) [DATALAD] added content
d9bf260 [DATALAD] added content
9f7816c [DATALAD] added content
571d873 (tag: 57fed018cce88d000ac1757f, tag: 00001) [DATALAD] added content
> echo '=== Verify sparse-excluded files were NOT deleted ==='
=== Verify sparse-excluded files were NOT deleted ===
> git status
On branch wt-sub-01
You are in a sparse checkout with 13% of tracked files present.
nothing to commit, working tree clean
>> pwd
> echo 'DONE. Worktree at /home/yoh/.tmp/dl-sparse-wt-XXWN8j1/ds000003/.worktrees/wt-sub-01'
DONE. Worktree at /home/yoh/.tmp/dl-sparse-wt-XXWN8j1/ds000003/.worktrees/wt-sub-01
bash -x sparse-worktree-datalad-run.sh 15,81s user 5,22s system 100% cpu 20,911 total
❯ cd /home/yoh/.tmp/dl-sparse-wt-XXWN8j1/ds000003/.worktrees/wt-sub-01
CHANGES README dataset_description.json derivatives/ participants.tsv sub-01/ task-rhymejudgment_bold.json
❯ git annex list
here
|origin
||s3-PUBLIC
|||web
||||bittorrent
|||||
X____ derivatives/checksums/sub-01_md5sums.txt
X_X__ sub-01/anat/sub-01_T1w.nii.gz
X_X__ sub-01/anat/sub-01_inplaneT2.nii.gz
X_X__ sub-01/func/sub-01_task-rhymejudgment_bold.nii.gz
❯ git log derivatives/checksums/sub-01_md5sums.txt
commit 4954b7b3cc7e53ea596db12feae70e840cdd96c3 (HEAD -> wt-sub-01)
Author: Yaroslav Halchenko <debian@onerussian.com>
Date: Mon Mar 2 11:51:44 2026 -0500
[DATALAD RUNCMD] Compute md5sums for sub-01
=== Do not change lines below ===
{
"chain": [],
"cmd": "bash -c 'find -L sub-01/ -type f | sort | xargs md5sum > derivatives/checksums/sub-01_md5sums.txt'",
"exit": 0,
"extra_inputs": [],
"inputs": [
"sub-01/**"
],
"outputs": [
"derivatives/checksums/sub-01_md5sums.txt"
],
"pwd": "."
}
^^^ Do not change lines above ^^^
Next steps
I am not even sure we need anything in git-annex or datalad to support this here -- may be it just needs to add worktree checkout ? but in longer run we do better support worktree explicitly in datalad for various scenarios, e.g.