-
Notifications
You must be signed in to change notification settings - Fork 0
Description
We've discussed this before in person, but I wanted to make a test case to demonstrate why the operational model this test pipeline suggests isn't viable in most of our pipelines.
So far
This repo contains a range of Stages and Jobs
- each Stage has a corresponding Job file with logic in it
- each block of job logic generates a Hail Batch Job containing some bash code
This is fine if all our pipelines do is run 3rd party tools - the commands will be baked into the Job, the Job will run in an image with the tool installed, and Hail mediates the input/output. This will be true of some pipelines.
Issue
Most of our pipelines contain custom logic in both the stages, and the jobs. This is crucial, as we want to apply the logic to the data once it has been generated in the pipeline, not in the Driver job alone (e.g. custom code to parse a VCF which hasn't been generated until the pipeline is halfway completed).
This isn't unique to us by any means, GATK-SV is an example of another pipeline we dabble in which has a whole directory of scripts, baked into the pipeline docker images, then the command issued in each stage is an instruction to run the script which already exists in the container (example).
Example workflow
https://batch.hail.populationgenomics.org.au/batches/590698
Code executed from this PR #11
This is a trivially simple pythonJob - open a text file and read the contents - and it's impossible with this model. The code to be executed is present only in the driver image due to a repository checkout, and is absent in the child jobs. This causes a failure to import the required code in the Job, causing a failure.
So...
I'm not really sure what this is supposed to accomplish. We've spoken in abstract terms about the implications of requiring code in both the driver and job containers, and I figured a specific example would lead to a more productive discussion.
n.b.
I don't think cpg-flow needs to be changed. As it exists in v0.1.2 cpg-flow has all the functionality of cpg_workflows, so long as it's used correctly. The quibble I have is with this being the suggested usage pattern for cpg-flow (clone a pipeline repository into a cpg-flow container at runtime).