Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
696b193
`Machine` class for GCP
aednichols Oct 9, 2025
e2f5bbb
Test cases
aednichols Oct 9, 2025
83d50e5
Merge remote-tracking branch 'origin/develop' into aen_an_751
aednichols Oct 9, 2025
2b6b199
New validation
aednichols Oct 14, 2025
29a0441
Merge remote-tracking branch 'origin/develop' into aen_an_751
aednichols Oct 14, 2025
775b1ed
`scalafmtAll`
aednichols Oct 14, 2025
9153f00
Enhanced `toString`
aednichols Oct 15, 2025
2eca2f1
Enhance tests
aednichols Oct 15, 2025
192172a
Disable no-op Scaladoc generation
aednichols Oct 15, 2025
68ce96e
Enhance tests to check instance metadata
aednichols Oct 15, 2025
678e29d
Add GPU test
aednichols Oct 15, 2025
c95c551
Docs
aednichols Oct 15, 2025
a5af024
Remove Life Sciences references
aednichols Oct 15, 2025
9359832
Fix markdown syntax
aednichols Oct 15, 2025
11791ce
Maybe this fixes syntax?
aednichols Oct 15, 2025
a21c860
Changelog
aednichols Oct 15, 2025
dd6fa13
Further clean up `nvidiaDriverVersion`
aednichols Oct 15, 2025
19966c6
Extra explain `cpuPlatform`
aednichols Oct 15, 2025
721bd0c
Clarify comment
aednichols Oct 15, 2025
67dc13d
Rename: camelCase to match other attrs
aednichols Oct 15, 2025
aa8dbc4
Just Say No to stack traces
aednichols Oct 15, 2025
5f799c8
Test orderly failure for invalid type
aednichols Oct 15, 2025
c5e9025
`e2-medium` is cheapest sensible VM
aednichols Oct 16, 2025
750328b
Docs
aednichols Oct 16, 2025
112dedd
Boop tests by updating docs
aednichols Oct 16, 2025
dbe3b17
Merge branch 'develop' into aen_an_751
aednichols Oct 17, 2025
63b5f72
Fix RTD anchor
aednichols Oct 17, 2025
f9a948e
Fix RTD
aednichols Oct 17, 2025
3ede6e9
Merge remote-tracking branch 'origin/develop' into aen_an_751
aednichols Oct 17, 2025
ee131a4
Merge remote-tracking branch 'origin/aen_an_751' into aen_an_751
aednichols Oct 17, 2025
1029f9f
Fix Changelog
aednichols Oct 17, 2025
47fd950
Oh good gravy
aednichols Oct 17, 2025
8c82dcf
Moar doc
aednichols Oct 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,16 @@
* WDL 1.1 support is in progress. Users that would like to try out the current partial support can do so by using WDL version `development-1.1`. In Cromwell 92, `development-1.1` has been enhanced to include:
* Support for passthrough syntax for call inputs, e.g. `{ input: foo }` rather than `{ input: foo = foo }`.

### GPU changes on Google Cloud backend

#### Removed `nvidiaDriverVersion`

In GCP Batch, the `nvidiaDriverVersion` attribute is ignored. Now that Life Sciences has retired, the attribute is now fully deprecated and can be removed from workflows.

#### Added `predefinedMachineType` (alpha)

The new `predefinedMachineType` attribute is introduced in experimental status. See [the attribute's docs](https://cromwell.readthedocs.io/en/develop/RuntimeAttributes/#predefinedmachinetype-alpha) for details.

### Database Migration
The index `IX_METADATA_ENTRY_WEU_CFQN_JSI_JRA_MK` is added to `METADATA_ENTRY`. In pre-release testing, the migration proceeded at about 3 million rows per minute. Please plan downtime accordingly.

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"minimal_hello_world.machine_type": "e2-medium"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
name: e2-medium
testFormat: workflowsuccess
backends: [GCPBATCH]

files {
workflow: gcp_machine_type.wdl
inputs: e2-medium.json
}

# `e2-medium` is the cheapest machine that works decently in Batch, costing 20% less
# than the next alternative. May be suitable for a variety of "I just need a VM" tasks.
# https://cloud.google.com/compute/docs/general-purpose-machines#sharedcore
metadata {
"calls.minimal_hello_world.hello_world.runtimeAttributes.predefinedMachineType": "e2-medium"
"calls.minimal_hello_world.hello_world.runtimeAttributes.preemptible": "0"
"outputs.minimal_hello_world.actual_machine_type": ~~"machineTypes/e2-medium"
"outputs.minimal_hello_world.is_preemptible": "FALSE"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"minimal_hello_world.machine_type": "banana"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
name: gcp_machine_type
testFormat: workflowsuccess
backends: [GCPBATCH]

files {
workflow: gcp_machine_type.wdl
}

metadata {
"calls.minimal_hello_world.hello_world.runtimeAttributes.predefinedMachineType": "n2-standard-2"
"calls.minimal_hello_world.hello_world.runtimeAttributes.preemptible": "0"
"outputs.minimal_hello_world.actual_machine_type": ~~"machineTypes/n2-standard-2"
"outputs.minimal_hello_world.is_preemptible": "FALSE"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
version 1.0

workflow minimal_hello_world {
input {
String image = "rockylinux/rockylinux:10"
String machine_type = "n2-standard-2"
Int preemptible = 0
String zones = "northamerica-northeast1-a northamerica-northeast1-b northamerica-northeast1-c"
}

call hello_world {
input:
image = image,
machine_type = machine_type,
preemptible = preemptible,
zones = zones
}

output {
String stdout = hello_world.stdout
String actual_machine_type = hello_world.actual_machine_type
String is_preemptible = hello_world.is_preemptible
}
}

task hello_world {

input {
String image
String machine_type
Int preemptible
String zones
}

# Check machine specs by querying instance metadata
# https://cloud.google.com/compute/docs/metadata/predefined-metadata-keys#instance-metadata
command <<<
cat /etc/os-release
uname -a
cat /proc/cpuinfo
curl --header "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/machine-type > actual_machine_type.txt
curl --header "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/scheduling/preemptible > is_preemptible.txt
>>>

runtime {
docker: image
predefinedMachineType: machine_type
preemptible: preemptible
zones: zones
}

meta {
volatile: true
}

output {
String stdout = read_string(stdout())
String actual_machine_type = read_string("actual_machine_type.txt")
String is_preemptible = read_string("is_preemptible.txt")
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
name: gcp_machine_type_fail
testFormat: workflowfailure
backends: [GCPBATCH]

files {
workflow: gcp_machine_type.wdl
inputs: fail_inputs.json
}

# Batch rejects the task and Cromwell fails it in an orderly manner
metadata {
"failures.0.causedBy.0.message": ~~"GCP Batch task exited with Success(0). "
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
name: gcp_machine_type_gpu
testFormat: workflowsuccess
backends: [GCPBATCH]

# Creates a `g2-standard-4` VM: 1 NVIDIA L4 GPU, 4 vCPUs, 16GB RAM
# This is the cheapest machine type under the new type-based GPU model, replacing the older machine type + gpu type scheme.
# For more information, see https://broadworkbench.atlassian.net/browse/AN-758

files {
workflow: gcp_machine_type.wdl
inputs: gpu_inputs.json
}

metadata {
"calls.minimal_hello_world.hello_world.runtimeAttributes.predefinedMachineType": "g2-standard-4"
"calls.minimal_hello_world.hello_world.runtimeAttributes.preemptible": "0"
"outputs.minimal_hello_world.actual_machine_type": ~~"machineTypes/g2-standard-4"
"outputs.minimal_hello_world.is_preemptible": "FALSE"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
name: gcp_machine_type_preemptible
testFormat: workflowsuccess
backends: [GCPBATCH]

files {
workflow: gcp_machine_type.wdl
inputs: preemptible_inputs.json
}

metadata {
"calls.minimal_hello_world.hello_world.runtimeAttributes.predefinedMachineType": "n2-standard-2"
"calls.minimal_hello_world.hello_world.runtimeAttributes.preemptible": "5"
"outputs.minimal_hello_world.actual_machine_type": ~~"machineTypes/n2-standard-2"
"outputs.minimal_hello_world.is_preemptible": "TRUE"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"minimal_hello_world.machine_type": "g2-standard-4",
"minimal_hello_world.zones": "us-east4-a us-east4-c"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"minimal_hello_world.preemptible": 5
}
48 changes: 39 additions & 9 deletions docs/RuntimeAttributes.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,9 +60,10 @@ There are a number of additional runtime attributes that apply to the Google Clo

- [zones](#zones)
- [preemptible](#preemptible)
- [predefinedMachineType](#predefinedmachinetype-alpha)
- [bootDiskSizeGb](#bootdisksizegb)
- [noAddress](#noaddress)
- [gpuCount, gpuType, and nvidiaDriverVersion](#gpucount-gputype-and-nvidiadriverversion)
- [gpuCount and gpuType](#gpucount-and-gputype)
- [cpuPlatform](#cpuplatform)


Expand Down Expand Up @@ -315,6 +316,38 @@ runtime {

Defaults to the configuration setting `genomics.default-zones` in the Google Cloud configuration block, which in turn defaults to using `us-central1-b`.

### `predefinedMachineType` (alpha)

*Default: none*

**This attribute is in experimental status. Please see limitations for details.**

Select a specific GCP machine type, such as `n2-standard-2` or `a2-highgpu-1g`.

Setting `predefinedMachineType` overrides `cpu`, `memory`, `gpuCount`, and `gpuType`.

`predefinedMachineType` _is_ compatible with `cpuPlatform` so long as the platform is [a valid option](https://cloud.google.com/compute/docs/cpu-platforms) for the specified type.

```
runtime {
predefinedMachineType: "n2-standard-2"
}
```

Possible benefits:

* Access [GPU machine types](https://cloud.google.com/compute/docs/gpus#gpu-models) such as Ampere, Lovelace, and other newer models
* Avoid [5% surcharge](https://cloud.google.com/compute/docs/instances/creating-instance-with-custom-machine-type#custom_machine_type_pricing) on custom machine types (Cromwell default)
* Reduce preemption by using predefined types with [better availability](https://cloud.google.com/compute/docs/instances/create-use-preemptible#best_practices)
* Run basic tasks at the lowest possible cost with [shared-core machines](https://cloud.google.com/compute/docs/general-purpose-machines#sharedcore) like `e2-medium`

Limitations:

* Cost estimation not yet supported
* GPU availability may be limited due to resource or quota exhaustion
* GCP types are non-portable and proprietary to Google Cloud Platform
* GCP Batch job details display incorrect "Cores", "Memory" values (cosmetic)

### `preemptible`

*Default: _0_*
Expand Down Expand Up @@ -395,10 +428,10 @@ Configure your Google network to use "Private Google Access". This will allow yo

That's it! You can now run with `noAddress` runtime attribute and it will work as expected.

### `gpuCount`, `gpuType`, and `nvidiaDriverVersion`
### `gpuCount` and `gpuType`

Attach GPUs to the instance when running on the Pipelines API([GPU documentation](https://cloud.google.com/compute/docs/gpus/)).
Make sure to choose a zone for which the type of GPU you want to attach is available.
Attach [GPUs](https://cloud.google.com/compute/docs/gpus/) to the [GCP Batch instance](https://cloud.google.com/batch/docs/create-run-job-gpus).
Make sure to choose a zone in which the type of GPU you want is available.

The types of compute GPU supported are:

Expand All @@ -407,19 +440,16 @@ The types of compute GPU supported are:
* `nvidia-tesla-p4`
* `nvidia-tesla-t4`

On Life Sciences API, the default driver is `418.87.00`. You may specify your own via the `nvidiaDriverVersion` key. Make sure that driver exists in the `nvidia-drivers-us-public` beforehand, per the [Google Pipelines API documentation](https://cloud.google.com/genomics/reference/rest/Shared.Types/Metadata#VirtualMachine).

On GCP Batch, `nvidiaDriverVersion` is currently ignored; Batch selects the correct driver version automatically.

```
runtime {
gpuType: "nvidia-tesla-t4"
gpuCount: 2
nvidiaDriverVersion: "418.87.00"
zones: ["us-central1-c"]
}
```

`nvidiaDriverVersion` is deprecated and ignored; GCP Batch selects the correct driver version automatically.

### `cpuPlatform`

This option is specific to the Google Cloud backend, specifically [this](https://cloud.google.com/compute/docs/instances/specify-min-cpu-platform) feature when a certain minimum CPU platform is desired.
Expand Down
2 changes: 1 addition & 1 deletion src/ci/bin/testCheckPublish.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,6 @@ cromwell::build::setup_common_environment
cromwell::build::pip_install mkdocs
mkdocs build -s

sbt -Dsbt.supershell=false --warn +package assembly dockerPushCheck +doc
sbt -Dsbt.supershell=false --warn +package assembly dockerPushCheck
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disable no-op Scaladoc generation: We write the HTML docs to disk on the CI instance and then throw them away.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(My block comment in GcpBatchRequestFactoryImpl.scala is invalid scaladoc, which is how I found out we did scaladoc)


git secrets --scan-history
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ object GcpBatchAsyncBackendJobExecutionActor {

new Exception(
s"Task $jobTag failed. $returnCodeMessage GCP Batch task exited with ${errorCode}(${errorCode.code}). ${message}"
)
) with NoStackTrace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the no stack trace?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we deliberately create exceptions in the program flow, my opinion is that they should never have a stack trace as it clutters the log and is not relevant for debugging.

A second order issue is that users often diligently copy-paste entire stack traces, rendering Slack threads and Zendesk cases unreadable.

After:

2025-10-16 14:38:12 cromwell-system-akka.dispatchers.engine-dispatcher-5 INFO  -
  WorkflowManagerActor: Workflow 974aa6ec-eccf-4267-8e83-65f230967dd6 failed (during ExecutingWorkflowState): cromwell.backend.google.batch.actors.GcpBatchAsyncBackendJobExecutionActor$$anon$1: Task minimal_hello_world.hello_world:NA:1 failed.
  The job was stopped before the command finished. GCP Batch task exited with Success(0).

Before:

2025-10-16 16:58:09 cromwell-system-akka.dispatchers.engine-dispatcher-111 INFO  -
  WorkflowManagerActor: Workflow 70e6cac9-e991-48a6-92e9-da333c209e1e failed (during ExecutingWorkflowState): java.lang.Exception: Task minimal_hello_world.hello_world:NA:1 failed.
  The job was stopped before the command finished. GCP Batch task exited with Success(0). 
	at cromwell.backend.google.batch.actors.GcpBatchAsyncBackendJobExecutionActor$.StandardException(GcpBatchAsyncBackendJobExecutionActor.scala:83)
	at cromwell.backend.google.batch.actors.GcpBatchAsyncBackendJobExecutionActor.handleFailedRunStatus$1(GcpBatchAsyncBackendJobExecutionActor.scala:1152)
	at cromwell.backend.google.batch.actors.GcpBatchAsyncBackendJobExecutionActor.$anonfun$handleExecutionFailure$1(GcpBatchAsyncBackendJobExecutionActor.scala:1168)
	at scala.util.Try$.apply(Try.scala:210)
	at cromwell.backend.google.batch.actors.GcpBatchAsyncBackendJobExecutionActor.handleExecutionFailure(GcpBatchAsyncBackendJobExecutionActor.scala:1160)
	at cromwell.backend.google.batch.actors.GcpBatchAsyncBackendJobExecutionActor.handleExecutionFailure(GcpBatchAsyncBackendJobExecutionActor.scala:144)
	at cromwell.backend.standard.StandardAsyncExecutionActor$$anonfun$handleExecutionResult$11.applyOrElse(StandardAsyncExecutionActor.scala:1506)
	at cromwell.backend.standard.StandardAsyncExecutionActor$$anonfun$handleExecutionResult$11.applyOrElse(StandardAsyncExecutionActor.scala:1503)
	at scala.concurrent.impl.Promise$Transformation.run(Promise.scala:490)
	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:49)
	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

}

// GCS path regexes comments:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ import com.google.cloud.batch.v1.{
import com.google.protobuf.Duration
import cromwell.backend.google.batch.io.GcpBatchAttachedDisk
import cromwell.backend.google.batch.models.GcpBatchConfigurationAttributes.GcsTransferConfiguration
import cromwell.backend.google.batch.models.{GcpBatchRequest, VpcAndSubnetworkProjectLabelValues}
import cromwell.backend.google.batch.models.{GcpBatchRequest, MachineType, VpcAndSubnetworkProjectLabelValues}
import cromwell.backend.google.batch.runnable._
import cromwell.backend.google.batch.util.{BatchUtilityConversions, GcpBatchMachineConstraints}
import cromwell.core.labels.{Label, Labels}
Expand Down Expand Up @@ -256,14 +256,33 @@ class GcpBatchRequestFactoryImpl()(implicit gcsTransferConfiguration: GcsTransfe
isBackground = _.getBackground
)

/**
* The "compute resource" concept is a suggestion to Batch regarding how many jobs can fit on a single VM.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I 100% understand. Why do we supply the compute resource if it is not used and lead to UI confusion?

Copy link
Collaborator Author

@aednichols aednichols Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because otherwise Google displays default values that are even more wrong.

If we make a one-line change to develop to omit setComputeResource() we still get the right machine shape, we just get Google's default values in the UI:

Screenshot 2025-10-16 at 15 32 25

In the future we could enhance the code to calculate a CPU and memory size for each predefinedMachineShape and set them in the UI as well. As far as I can tell this is a nice-to-have, maybe it will happen as part of the cost enhancements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotcha, yeah definitely a follow up thing to do if you can add that to the cost ticket

* The Cromwell backend currently creates VMs at a 1:1 ratio with jobs, so the compute resource is effectively ignored.
*
* That said, it has a cosmetic effect in the Batch web UI, where it drives the "Cores" and "Memory" readouts.
* The machine type is the "real" VM shape; one can set bogus cores/memory in the compute resource,
* and it will have no effect other than the display.
*/
val computeResource = createComputeResource(cpuCores, memory, gcpBootDiskSizeMb)
val taskSpec = createTaskSpec(sortedRunnables, computeResource, durationInSeconds, allVolumes)
val taskGroup: TaskGroup = createTaskGroup(taskCount, taskSpec)
val machineType = GcpBatchMachineConstraints.machineType(runtimeAttributes.memory,
runtimeAttributes.cpu,
cpuPlatformOption = runtimeAttributes.cpuPlatform,
jobLogger = jobLogger
)

val machineType = runtimeAttributes.machine match {
case Some(m: MachineType) =>
// Allow users to select predefined machine types, such as `n2-standard-4`.
// Overrides CPU count and memory attributes.
// We still pass platform when machine is specified, it is the user's responsibility to select a valid type/platform combination
m.machineType
case None =>
// CPU platform drives selection of machine type, but is not encoded in the `machineType` return value itself
GcpBatchMachineConstraints.machineType(runtimeAttributes.memory,
runtimeAttributes.cpu,
cpuPlatformOption = runtimeAttributes.cpuPlatform,
jobLogger = jobLogger
)
}

val instancePolicy =
createInstancePolicy(cpuPlatform = cpuPlatform, spotModel, accelerators, allDisks, machineType = machineType)
val locationPolicy = LocationPolicy.newBuilder.addAllAllowedLocations(zones.asJava).build
Expand Down
Loading
Loading