-
Notifications
You must be signed in to change notification settings - Fork 0
Formalize YODA principles #3
Copy link
Copy link
Open
Description
YODA has also been proposed to be a standard/best-practice for ReproNim ReproNim/repronim.org#206.
IMO, YODA should clearly separate the principles from the suggestions, and should be fully decoupled from DataLad.
"Standards speak" would need to be expounded and explained to make sense to the unfamiliar, but this is what I have in mind for the formal bit. wdyt @yarikoptic?
YODA IDEALS
- "YODA compliant datasets" contain well-defined, portable computational environments to compute analysis results.
- "YODA compliant datasets" preserve provenance of the computational procedures that produce or alter derivative data.
- "YODA compliant datasets" strive for reproducibility.
YODA PRINCIPLES:
- All assets essential to replicate computational execution MUST be included
- All assets essential to replicate computational execution MUST be version controlled
- All assets essential to replicate computational execution SHOULD be version controlled using the
same version control system - All assets essential to replicate computational execution MAY be linked(subdataset) or included directly in the dataset
- Provenance of all modifications to the assets MUST be annotated
- Dataset structure SHOULD accommodate domain standards
- Assets SHOULD be organized in a modular structure
YODA ASSETS:
(This part could probably be left out of the formal section and discussed in the detailed explanation)
MUST:
- input data
- custom analysis code/scripts (upstream or custom code)
- computational environments (e.g. as container images)
- Documentation
SHOULD:
- Test scripts
- Automation
NOTES
Original Organigram: https://f1000research.com/posters/7-1965
Top level
Track all input data, code, and computational environments needed to produce analysis outputs in
version controlled datasets — and reproducibility you will achieve!
Learn control you must.
Size matters not!
- Subdataset references in a dataset are
extremely lightweight yet guarantee data identity via cryptographic hashes. Subdatasets can be
detached without losing this information, yielding massively improved storage efficiency and
reduced archive costs.
- Publicly shared data compliant with a common standard are an optimal element in a modular study
setup. From mid-2018 OpenNeuro (previously OpenFMRI) will offer DataLad datasets for direct
download
Principles
*P1* Use well-defined, portable computational environments to compute analysis results
*P2* Exhaustively track ALL analysis inputs in the same version control system
as the computed results, including:
- input data
- custom analysis code/scripts
- required computational environments (e.g. as container images)
*P3* Structure study elements (data, code, environments) in modular
components to facilitate reuse within or outside the context of the
original study
Dataset Layout
Dataset structure is fully flexible to be able to accommodate domain standards (e.g. BIDS). Element
location/name can be discovered from configuration.
Required (3rd-party) code repositories can be referenced as subdatasets just like datasets with data
files. Repository state is unambiguous version record.
Images of containerized computational environments are tracked in version control just like any
other data file. Actual storage can be local or in cloud
Any input data is referenced via the dataset that contans it. Dataset state provides unambi- guous
version specification for any data dependency.
DataLad can obtain required subdataset content on demand. Only content elements actually required
for an analysis are present. Directory structure is expanded recursively as needed
Test scripts can be used to check analysis code, verify data integrity, and assess computational
reproducibility.
Datalad Handbook
https://handbook.datalad.org/en/latest/basics/101-127-yoda.html
Principles
P1: One thing, one dataset
P2: Record where you got it from, and where it is now
P3: Record what you did to it, and with what
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels