Skip to content

Support git worktree #338

@yarikoptic

Description

@yarikoptic

Summary

This is a follow up to my fresh post

https://git-annex.branchable.com/forum/support_for_git_sparse_checkout/#comment-56834ee8e126c80c48b0297c566919c1

following to worktree exploration work by @just-meng, which then boiled down to me not finding in my search worktree in babs, thus suggesting to investigate that feature of git, for which git-annex support already exists!

and git worktree has "support" for sparse checkout

❯ git worktree --help | grep -A4 -e '--\[no.*checkout'
       --[no-]checkout
           By default, add checks out <commit-ish>, however, --no-checkout can be used to suppress checkout in
           order to make customizations, such as configuring sparse-checkout. See "Sparse checkout" in git-
           read-tree(1).

so it might be boiling down to just giving it a shot - as e.g. cloning "large" dataset with annexed content, like https://github.com/OpenNeuroDatasets/ds000003 and doing some rudimentary datalad run (e.g. md5sum invocation) on --inputs being one sub/ folder...

ok - gave to claude and it created this demo, where it even figured to use --explicit for datalad run

https://www.oneukrainian.com/tmp/sparse-worktree-datalad-run.sh

which outputs ...
> git worktree add --no-checkout .worktrees/wt-sub-01
Preparing worktree (new branch 'wt-sub-01')
> cd .worktrees/wt-sub-01
> git sparse-checkout init --cone
> git sparse-checkout set sub-01 code
> git checkout
> echo '=== Worktree sparse content (top-level + sub-01 + code/) ==='
=== Worktree sparse content (top-level + sub-01 + code/) ===
> ls
CHANGES  dataset_description.json  participants.tsv  README  sub-01  task-rhymejudgment_bold.json
> echo '=== Subject folder ==='
=== Subject folder ===
> ls sub-01/
anat  func
> git annex info
trusted repositories: 0
semitrusted repositories: 6
	00000000-0000-0000-0000-000000000001 -- web
	00000000-0000-0000-0000-000000000002 -- bittorrent
	7caccbd2-81a6-49e5-a339-66fb9e9a4f36 -- [s3-PUBLIC]
	a95d8e05-b3fd-49a3-834f-9feb936b7c1e -- root@107c430b4146:/datalad/ds000003
	e0770190-db4e-4e8c-a9a1-4d83fb56acfd -- s3-PRIVATE
	e8b6cb2e-5774-44ed-99bc-a1f6605c0131 -- yoh@bilena:~/.tmp/dl-sparse-wt-XXWN8j1/ds000003 [here]
untrusted repositories: 0
transfers in progress: none
available local disk space: 7.31 gigabytes (+0 bytes reserved)
local annex keys: 0
local annex size: 0 bytes
annexed files in working tree: 3
size of annexed files in working tree: 31.74 megabytes
combined annex size of all repositories: 826.61 megabytes
annex sizes of repositories: 
	413.3 MB: 7caccbd2-81a6-49e5-a339-66fb9e9a4f36 -- [s3-PUBLIC]
	413.3 MB: a95d8e05-b3fd-49a3-834f-9feb936b7c1e -- root@107c430b4146:/datalad/ds000003
backend usage: 
	MD5E: 3
bloom filter size: 32 mebibytes (0% full)
> git annex init
init  ok
> datalad get sub-01/
/home/yoh/proj/datalad/trash/.venv/lib/python3.13/site-packages/requests/__init__.py:113: RequestsDependencyWarning: urllib3 (2.6.3) or chardet (6.0.0.post1)/charset_normalizer (3.4.4) doesn't match a supported version!
  warnings.warn(
get(ok): sub-01/anat/sub-01_inplaneT2.nii.gz (file) [from s3-PUBLIC...]                                                                                                                                                           
get(ok): sub-01/anat/sub-01_T1w.nii.gz (file) [from s3-PUBLIC...]                                                                                                                                                                 
get(ok): sub-01/func/sub-01_task-rhymejudgment_bold.nii.gz (file) [from s3-PUBLIC...]                                                                                                                                             
get(ok): sub-01 (directory)                                                                                                                                                                                                       
action summary:
  get (ok: 4)
> echo '=== After get ==='
=== After get ===
> ls -la sub-01/
total 0
drwxrwsr-x 1 yoh yoh  16 Mar  2 11:51 .
drwxrwsr-x 1 yoh yoh 210 Mar  2 11:51 ..
drwxrwsr-x 1 yoh yoh  80 Mar  2 11:51 anat
drwxrwsr-x 1 yoh yoh 146 Mar  2 11:51 func
> ls -la sub-01/anat/ sub-01/func/
sub-01/anat/:
total 8
drwxrwsr-x 1 yoh yoh  80 Mar  2 11:51 .
drwxrwsr-x 1 yoh yoh  16 Mar  2 11:51 ..
lrwxrwxrwx 1 yoh yoh 138 Mar  2 11:51 sub-01_inplaneT2.nii.gz -> ../../.git/annex/objects/J3/q7/MD5E-s664614--0f8bc47f9c3047b340abfcd3ce1fb021.nii.gz/MD5E-s664614--0f8bc47f9c3047b340abfcd3ce1fb021.nii.gz
lrwxrwxrwx 1 yoh yoh 140 Mar  2 11:51 sub-01_T1w.nii.gz -> ../../.git/annex/objects/jJ/2v/MD5E-s5712417--0d1e0a7ff7063250404f45a955a66203.nii.gz/MD5E-s5712417--0d1e0a7ff7063250404f45a955a66203.nii.gz

sub-01/func/:
total 8
drwxrwsr-x 1 yoh yoh  146 Mar  2 11:51 .
drwxrwsr-x 1 yoh yoh   16 Mar  2 11:51 ..
lrwxrwxrwx 1 yoh yoh  142 Mar  2 11:51 sub-01_task-rhymejudgment_bold.nii.gz -> ../../.git/annex/objects/29/WX/MD5E-s25362403--d72afc284b7608e29ab1e92c75513f69.nii.gz/MD5E-s25362403--d72afc284b7608e29ab1e92c75513f69.nii.gz
-rw-rw-r-- 1 yoh yoh 1418 Mar  2 11:51 sub-01_task-rhymejudgment_events.tsv
> mkdir -p derivatives/checksums
> datalad run -m 'Compute md5sums for sub-01' --explicit --input 'sub-01/**' --output derivatives/checksums/sub-01_md5sums.txt -- bash -c 'find -L sub-01/ -type f | sort | xargs md5sum > derivatives/checksums/sub-01_md5sums.txt'
/home/yoh/proj/datalad/trash/.venv/lib/python3.13/site-packages/requests/__init__.py:113: RequestsDependencyWarning: urllib3 (2.6.3) or chardet (6.0.0.post1)/charset_normalizer (3.4.4) doesn't match a supported version!
  warnings.warn(
[INFO   ] Making sure inputs are available (this may take some time) 
[INFO   ] == Command start (output follows) ===== 
[INFO   ] == Command exit (modification check follows) ===== 
run(ok): /home/yoh/.tmp/dl-sparse-wt-XXWN8j1/ds000003/.worktrees/wt-sub-01 (dataset) [bash -c 'find -L sub-01/ -type f | sort ...]
add(ok): derivatives/checksums/sub-01_md5sums.txt (file)                                                                                                                                                                          
save(ok): . (dataset)                                                                                                                                                                                                             
> echo '=== Result ==='                                                                                                                                                                                                           
=== Result ===
> cat derivatives/checksums/sub-01_md5sums.txt
0f8bc47f9c3047b340abfcd3ce1fb021  sub-01/anat/sub-01_inplaneT2.nii.gz
0d1e0a7ff7063250404f45a955a66203  sub-01/anat/sub-01_T1w.nii.gz
d72afc284b7608e29ab1e92c75513f69  sub-01/func/sub-01_task-rhymejudgment_bold.nii.gz
6c0f223452837e1878466eb935ddffc8  sub-01/func/sub-01_task-rhymejudgment_events.tsv
> echo '=== git log ==='
=== git log ===
> git log --oneline -5
4954b7b (HEAD -> wt-sub-01) [DATALAD RUNCMD] Compute md5sums for sub-01
c090537 (tag: 1.0.0, origin/master, origin/HEAD, master) [DATALAD] added content
d9bf260 [DATALAD] added content
9f7816c [DATALAD] added content
571d873 (tag: 57fed018cce88d000ac1757f, tag: 00001) [DATALAD] added content
> echo '=== Verify sparse-excluded files were NOT deleted ==='
=== Verify sparse-excluded files were NOT deleted ===
> git status
On branch wt-sub-01
You are in a sparse checkout with 13% of tracked files present.

nothing to commit, working tree clean
>> pwd
> echo 'DONE. Worktree at /home/yoh/.tmp/dl-sparse-wt-XXWN8j1/ds000003/.worktrees/wt-sub-01'
DONE. Worktree at /home/yoh/.tmp/dl-sparse-wt-XXWN8j1/ds000003/.worktrees/wt-sub-01
bash -x sparse-worktree-datalad-run.sh  15,81s user 5,22s system 100% cpu 20,911 total
❯ cd /home/yoh/.tmp/dl-sparse-wt-XXWN8j1/ds000003/.worktrees/wt-sub-01
CHANGES  README  dataset_description.json  derivatives/  participants.tsv  sub-01/  task-rhymejudgment_bold.json
❯ git annex list
here
|origin
||s3-PUBLIC
|||web
||||bittorrent
|||||
X____ derivatives/checksums/sub-01_md5sums.txt
X_X__ sub-01/anat/sub-01_T1w.nii.gz
X_X__ sub-01/anat/sub-01_inplaneT2.nii.gz
X_X__ sub-01/func/sub-01_task-rhymejudgment_bold.nii.gz
❯ git log derivatives/checksums/sub-01_md5sums.txt
commit 4954b7b3cc7e53ea596db12feae70e840cdd96c3 (HEAD -> wt-sub-01)
Author: Yaroslav Halchenko <debian@onerussian.com>
Date:   Mon Mar 2 11:51:44 2026 -0500

    [DATALAD RUNCMD] Compute md5sums for sub-01
    
    === Do not change lines below ===
    {
     "chain": [],
     "cmd": "bash -c 'find -L sub-01/ -type f | sort | xargs md5sum > derivatives/checksums/sub-01_md5sums.txt'",
     "exit": 0,
     "extra_inputs": [],
     "inputs": [
      "sub-01/**"
     ],
     "outputs": [
      "derivatives/checksums/sub-01_md5sums.txt"
     ],
     "pwd": "."
    }
    ^^^ Do not change lines above ^^^

Next steps

I am not even sure we need anything in git-annex or datalad to support this here -- may be it just needs to add worktree checkout ? but in longer run we do better support worktree explicitly in datalad for various scenarios, e.g.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions