Pipeline model can't run custom code

We've discussed this before in person, but I wanted to make a test case to demonstrate why the operational model this test pipeline suggests isn't viable in most of our pipelines.

## So far

This repo contains a range of Stages and Jobs
- each Stage has a corresponding Job file with logic in it
- each block of job logic generates a Hail Batch Job containing some bash code

This is fine if all our pipelines do is run 3rd party tools - the commands will be baked into the Job, the Job will run in an image with the tool installed, and Hail mediates the input/output. This will be true of some pipelines.

## Issue

Most of our pipelines contain custom logic in both the stages, and the jobs. This is crucial, as we want to apply the logic to the data once it has been generated in the pipeline, not in the Driver job alone (e.g. custom code to parse a VCF which hasn't been generated until the pipeline is halfway completed).

This isn't unique to us by any means, GATK-SV is an example of another pipeline we dabble in which has a whole directory of [scripts](https://github.com/broadinstitute/gatk-sv/tree/main/src/sv-pipeline), baked into the pipeline docker images, then the command issued in each stage is an instruction to run the script which already exists in the container ([example](https://github.com/broadinstitute/gatk-sv/blob/26d437bab17c8d5e7de29add94f3fc02301d815f/wdl/GenerateBatchMetrics.wdl#L415)).

## Example workflow 

https://batch.hail.populationgenomics.org.au/batches/590698

Code executed from this PR https://github.com/populationgenomics/test_workflows_shared/pull/11

This is a trivially simple pythonJob - open a text file and read the contents - and it's impossible with this model. The code to be executed is present only in the driver image due to a repository checkout, and is absent in the child jobs. This causes a failure to import the required code in the Job, causing a failure.

## So...

I'm not really sure what this is supposed to accomplish. We've spoken in abstract terms about the implications of requiring code in both the driver and job containers, and I figured a specific example would lead to a more productive discussion.

## n.b. 

I don't think cpg-flow needs to be changed. As it exists in `v0.1.2` cpg-flow has all the functionality of cpg_workflows, so long as it's used correctly. The quibble I have is with this being the suggested usage pattern for cpg-flow (clone a pipeline repository into a cpg-flow container at runtime).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline model can't run custom code #12

So far

Issue

Example workflow

So...

n.b.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pipeline model can't run custom code #12

Description

So far

Issue

Example workflow

So...

n.b.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions