-
Notifications
You must be signed in to change notification settings - Fork 0
Practical Strategies for YODA in the Wild #2
Description
One incredible benefit to YODA (or in general annex-enabled repos) is that the resulting repo is very lightweight and thus the overhead of importing an entire repo as a submodule (code, inputs, outputs, environments, and all) is quite small if one doesn't also import all the annexed file contents.
However, for large codebases or involved analyses, the git repo might get quite hefty. In that case, it might be advisable to save data outputs to a submodule itself - a 'raw' style repo that just annexes data files with no processing or output layers. These 'raw' inputs then are maximally lightweight (just a git-tree of symlinks).
Granted, taking this to the extreme would result in lots of tiny repos floating around so I've considered solutions like grouping a bunch of 'raw' repos by branches into one repo. This of course introduces questions of maintaining provenance and readability (having several branches to checkout to navigate files).
Has anyone run into the issue where large datasets are a bit of a pain to import into subsequent stages of the analysis process? This would happen if the git repo itself becomes quite large (hundreds of commits or thousands of files). How do you store your repos and file contents and what convention might you suggest for storing raw repos so you don't get too inundated with raw repos?
For me, I use github for repos, with dropbox (rclone special remote) and a research server (bare annex-enabled repo) for file content storage. I've considered storing raw repos on github under dot-prefixed names (so if the dataset is at Collab/Proj-Phase.git, the raw repo(s) for this set are at Collab/.Proj-Phase.git with plural raws saved as branches within this repo). I also quite like git-annex's new git-remote-annex which allows the storage of git guts in a special remote with push/pull support because it also allows storing an arbitrary number of separate repos in the same place (alongside the content itself). Note it doesn't yet work with rclone.
For readability I am also considering using exporttree annexes - especially to dropbox - so a majority of file contents is in human-readable named format. When it comes to versioning file contents I try to separately name different variants of files and only update annexed files if the old version should no longer be referenced.
These are some of the ways I use YODA in the wild - whether writing my dissertation or performing analysis - so provenance is maintained, duplication of random files floating around is minimized, and file contents are still generally discoverable during the aggregation phase of e.g. writing a report. I would love to hear more about others' strategies in particular when datasets get large or how to use YODA when splitting processing/analysis into subdatasets for e.g. parallel processing.