Replies: 5 comments
-
|
@akhanf, @jordandekraker, @zihuaihuai. I think this solution should be quite workable, please let me know what you think. Moving into Christmas and the new year, I'll have quite a bit more time, so I think finishing this is quite feasible. |
Beta Was this translation helpful? Give feedback.
-
|
@Dhananjhay, would appreciate any input you have as well! |
Beta Was this translation helpful? Give feedback.
-
|
Looks interesting! Alot to digest and haven't given it the most thorough read yet, but seems like a promising avenue.. |
Beta Was this translation helpful? Give feedback.
-
|
Looks exciting! Will give it a read over the break, and get back to you soon. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks all. I'm planning to start implementation next week, so please get back with any major objections before then. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
The proposal addresses Snakebids' inability to index and process files with missing or variably ordered entities, a limitation caused by its reliance on fixed templates. It utilizes Snakemake's input functions and constrained wildcard templates to allow potentially missing entities, enabling flexible handling of heterogeneous filenames. Optional entities are paired with dummy wildcards that contain the tag. For users, the main effect is the appearance of these dummy wildcards in dry-run output, but workflows gain the ability to support datasets with inconsistent or partially specified entities without requiring significant migration.
Theory
The proposed solution replaces the coherent template
BidsComponent.pathwith an input function. The initialBidsComponentwould retain thezip_listtable of wildcard values, but instead of a unified template (.path), it would save the paths for each file in raw form. Missing entities would be encoded in thezip_listasNone. The initial snakemakerulewould collect itsinputfrom thisBidsComponentvia a method. Itsoutputwould be a new template standardized bybids()that supports potentially missing entities.Such a template is possible using snakemake's
wildcard_constraints, as in the following example:r"sub-{subject}/ses-{session}/sub-{subject}_ses-{session}{_acq_,(?:_acq\-)?}{acq,(?:(?<=_acq\-)[^_]+)?}{_dir_,(?:_dir\-)?}{dir,(?:(?<=_dir\-)[^_]+)?}_suffix.nii.gz"This ghastly form simplified by ignoring the
,constraintportion of each{wildcard}:r"sub-{subject}/ses-{session}/sub-{subject}_ses-{session}{_acq_}{acq}{_dir_}{dir}_suffix.nii.gz"Clearly, both
acqanddirare optional wildcards.{_acq_}is intended to match the tag portion of the path (_acq-) while{acq}matches its value. This intent is enforced in snakemake via the constraints:This can be tested using the
regex_from_filepatternutility function used in snakemake to processoutputtemplates:Importantly, these constrained patterns can be used in
input, which ignores the constraints, as well asoutput. Furthermore, all outputs accessed later byrules.RULE_NAME.output.OUTPUTwill have been converted into ordinary strings with no constraint syntax, so an ordinary Python.format()will work without issue.Although snakemake's constraint syntax was not likely intended to allow such optional components, I do not believe this approach to be an unstable "hack". Wildcard restraints have been a long-standing feature and have used python regex expression since the beginning. The above constraints use perfectly valid regex.
In practice, this results in extra, dummy wildcards containing the tag portion of the path when present. By convention, these wildcards are named the same as the entity, wrapped in single underscores (e.g.
_acq_). Within the workflow, I do not anticipate this causing any issues. The main consequence to users will be the appearance of these dummy wildcards in the dry run summary, but I think this is a tolerable side effect. However, snakebids will need to meet three new requirements:Requirements
Implementation
Template-free components
Currently, all
BidsComponents are initialized with a template stored inBidsComponent.path, the format of which is not compatible with optional entities. Three different situations must be handled.Heterogeneous components
These are components with entries of differing roots, entities in differing orders, or additional path elements unaccounted for by pybids (e.g. extra suffixes). In principle, with sufficiently expansive wildcard selection, an entire dataset could be fit within a single such component.
Potential handling methods are as follows:
In any case where a heterogeneous component is indexed (options 2 or 3), no coherent template would exist for
BidsComponent.path, so on invocation, it must error.Coherent template with optional entities
Coherent means all entries have the same root, the same directory structures, and a set of entities in the same order. Optional entities are absent from the path.
In this case,
BidsComponent.pathcould:Return a template with constrained wildcards as shown above.
Advantages:
Disadvantages:
str.format().BidsComponent.pathwould fail after indexing heterogeneous inputs.Raise an error. This would ensure applications can handle any potential dataset. Practically, this would deprecate
BidsComponent.path.Coherent templates without optional entities
These are the components currently supported by snakebids. This will obviously continue to be handled, however, if
BidsComponent.pathis essentially deprecated per above, it should eventually be removed entirely even for this standard case.Retrieving paths
Template-free components require a means of retrieving paths. Replacing
BidsComponent.path, a method (working name.get) shall be devised as follows:This method could be used as an input function for snakemake rules in lieu of a template, seamlessly handling all the above cases. As specified in I, entities set to the blank string would be interpreted as absent, potentially only if accompanied by a corresponding dummy entity (e.g.
_acq_) also set to the blank string.Template generation
Requirement II will be satisfied by endowing the
bidsfunction with a convention to indicate optional wildcards. Note that, in its current state,bids()hard-codes all_tags-, whereas optional entities require a dummy wildcard for the tag. A few options are available:A keyword-only argument (e.g.
optional_entities) that takes a list of entities to be made optional.BidsComponent.wildcardsmust include this key with the associated list in lieu of"key": "{value}"pairings for each optional entity. For instance, if"subject"and"session"were mandatory and"acq"optional:bids()would format these optional entities into the path using the appropriately constrained wildcard entries.This breaks the pure meaning of
BidsComponent.wildcards. A modification would move"optional_entities"to a new property:BidsComponent.optional_entities, but this would inconveniently need to be specified manually in every call tobids().It also adds an additional magic keyword argument to
bids(), something that we have been trying to avoid.A sentinel object indicating an optional entity (e.g.
OPTIONAL_WILDCARD). When given tobids()as an entity value, instead of being formatted as a string, it would cause the entity to be inserted using constrained wildcard syntax. Thewildcardsdict would be as follows:This also breaks the
dict[str, str]type of.wildcards, but is less invasive than the previous option and retains the nominal meaning of the keys.As a variation,
OPTIONAL_WILDCARDcould be replaced with an existing Python object, such as.... This would simplify the implementation but obscure the meaning.Generic expansion
For requirement III, the
BidsComponent.expand()implementation will be modified to correctly detect and handle both dummy wildcards and wildcards with snakemake constraints.Templates would be sanitized of any constraints (wrapped in braces and preceded by a comma:
{acq,restraint}.)When formatting a path with an absent entity (e.g.
"acq": None),.expand()will expect to find a corresponding dummy wildcard ({_acq_}), erroring if otherwise.Additional entities may not be dummy-formatted (i.e. wrapped in single underscores, e.g.
_acq_). For instance,component.expand(TEMPLATE, _acq_="foo")will raise an error, regardless whether"{_acq_}{acq}"appears in the template.Allowance for dummy wildcards
It is not uncommon for workflows to call
BidsComponent.filter()within an input function using thewildcardsargument. As this argument will potentially include dummy wildcards, allBidsComponentmethods must be able to handle them. As with.get(), the solution will generally be to ignore dummy wildcards and intepret wildcards set to the blank string as absent.I do not necessarily propose we modify the
filter_listfunction, as its usage should generally give way to the object-oriented.filter()method.Migration
Amazingly, existing workflows using the most recent snakebids API could begin supporting generic datasets with minimal to no migration. Supporting truly heterogeneous datasets would require using
BidsComponent.getinstead ofBidsComponent.path. Such a switch may be generally required ifBidsComponent.pathis deprecated. Otherwise, workflows correctly usingBidsComponent.wildcardswithinbids()calls would immediately get access to generic templates.BidsComponent.expand()would begin handling these templates without any user change required.Note that workflows using the
expand()function from Snakemake would not get this benefit, as this function will not support generic templates.Rejected Alternative
Previous issue #470 attempted to correct missing entities by adding a generic label. I have found this approach to be relatively unfeasible compared to the above.
It can essentially be conceptualized as entity remapping: paths with an entity initially set to
Noneare remapped to a given value. This operation could be performed with an input function such asBidsComponent.get. The problem is that the function would need knowledge of the remapping. I would be very hesitant to further complexify the internal state ofBidsComponentby saving knowledge of remapping operations. This precludes any automatic remapping withingenerate_inputs. Potentially a wrapper function could be devised that translates the remapped wildcards into the input values before callingBidsComponent.get. But this would boilerplate code in workflows to be supported.Thus, this solution would require fairly significant migration before being useful. It is certainly possible, and it is not mutually exclusive with the generic template solution, but I do not think it should be a priority at this time.
Beta Was this translation helpful? Give feedback.
All reactions