-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Summary
Currently BABS always zips outputs. For my workflows, raw BIDS-derivative output would be preferable because it enables cleaner integration into BIDS-study layouts with better provenance tracking.
In particular, zipping works fine for individual usage where some manual intervention is acceptable, but for larger automation workflows with many datasets, it would be ideal to produce the outputs into their final place.
BIDS-study output structure goal
See BIDS common principles and PR #1741 for background on study-level dataset organization.
my-study/
sourcedata/raw/ # BIDS raw dataset (subdataset)
code/
containers/ # containers should be considered part of code
derivatives/
mriqc/ # MRIQC outputs (subdataset)
sourcedata/raw/ # reference to input (YODA)
code/ # processing scripts (provenance)
sub-01/
sub-02/
dataset_description.json # single dataset_description instead of 1 from each subject
fmriprep/ # fMRIPrep outputs (subdataset)
sourcedata/raw/
code/
sub-01/
sub-02/
dataset_description.json
With raw outputs (no zip), after babs merge you could clone the output_ria directly into the BIDS-study:
datalad clone ria+file://output_ria#~data derivatives/mriqcThis gets very close to the desired structure - the clone is the final derivative, ready to use, with full provenance intact.
(Note: BABS's default input path is inputs/data, but this is overridable via path_in_babs config to use sourcedata/raw for BIDS-study compliance.)
Problems with current zip-only approach
- Provenance obscured: Unzipping with
datalad runrecords "unzip" as provenance rather than the original BIDS app command. The true provenance isn't lost, but it's obscured. - Duplicate large files: After unzipping, git-annex has both the zips and the raw files. These can be dropped, but it's an extra step that is easily forgotten. Minor but potentially wasteful.
- Unzip conflicts: Each per-subject zip contains shared files like
dataset_description.jsonand.bidsignore. When unzipping, these conflict. They're probably identical (same container version = same output), but with raw outputs stored in git (not zips), merge would catch any surprises.
For raw outputs to work
For per-subject branches to merge cleanly, JSON and TSV files should be stored in git (not annex) - either via text2git config or .gitattributes. This allows git's octopus merge to handle identical files, and surface conflicts if they unexpectedly differ.
Suggested config option
Ideally the user could choose:
- Raw outputs committed (no zip)
- Zipped outputs (current behavior)
Implementation approaches
Least invasive: Modify existing script generation template to optionally include the zip step. The generated *_zip.sh script already handles both execution and zipping - it could conditionally skip the zip.
With containers-run (see #328): Separating containers-run from zip enables two explicit commits - one for the BIDS app, one for zipping. This gives cleaner provenance and makes the zip step trivially optional, but is a larger change.