Skip to content

Feature Request: Make zipping optional #327

@asmacdo

Description

@asmacdo

Summary

Currently BABS always zips outputs. For my workflows, raw BIDS-derivative output would be preferable because it enables cleaner integration into BIDS-study layouts with better provenance tracking.

In particular, zipping works fine for individual usage where some manual intervention is acceptable, but for larger automation workflows with many datasets, it would be ideal to produce the outputs into their final place.

BIDS-study output structure goal

See BIDS common principles and PR #1741 for background on study-level dataset organization.

my-study/
  sourcedata/raw/               # BIDS raw dataset (subdataset)
  code/
    containers/                 # containers should be considered part of code
  derivatives/
    mriqc/                      # MRIQC outputs (subdataset)
      sourcedata/raw/           # reference to input (YODA)
      code/                     # processing scripts (provenance)
      sub-01/
      sub-02/
      dataset_description.json  # single dataset_description instead of 1 from each subject
    fmriprep/                   # fMRIPrep outputs (subdataset)
      sourcedata/raw/
      code/
      sub-01/
      sub-02/
      dataset_description.json

With raw outputs (no zip), after babs merge you could clone the output_ria directly into the BIDS-study:

datalad clone ria+file://output_ria#~data derivatives/mriqc

This gets very close to the desired structure - the clone is the final derivative, ready to use, with full provenance intact.

(Note: BABS's default input path is inputs/data, but this is overridable via path_in_babs config to use sourcedata/raw for BIDS-study compliance.)

Problems with current zip-only approach

  • Provenance obscured: Unzipping with datalad run records "unzip" as provenance rather than the original BIDS app command. The true provenance isn't lost, but it's obscured.
  • Duplicate large files: After unzipping, git-annex has both the zips and the raw files. These can be dropped, but it's an extra step that is easily forgotten. Minor but potentially wasteful.
  • Unzip conflicts: Each per-subject zip contains shared files like dataset_description.json and .bidsignore. When unzipping, these conflict. They're probably identical (same container version = same output), but with raw outputs stored in git (not zips), merge would catch any surprises.

For raw outputs to work

For per-subject branches to merge cleanly, JSON and TSV files should be stored in git (not annex) - either via text2git config or .gitattributes. This allows git's octopus merge to handle identical files, and surface conflicts if they unexpectedly differ.

Suggested config option

Ideally the user could choose:

  • Raw outputs committed (no zip)
  • Zipped outputs (current behavior)

Implementation approaches

Least invasive: Modify existing script generation template to optionally include the zip step. The generated *_zip.sh script already handles both execution and zipping - it could conditionally skip the zip.

With containers-run (see #328): Separating containers-run from zip enables two explicit commits - one for the BIDS app, one for zipping. This gives cleaner provenance and makes the zip step trivially optional, but is a larger change.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions