From ffdca2c1fc67a1e7fa0c576d2326ccf9111bc73f Mon Sep 17 00:00:00 2001 From: "David W.H. Swenson" Date: Fri, 19 Apr 2024 15:52:52 -0500 Subject: [PATCH 01/12] Add first few features --- features/hard_kill.feature | 9 +++++++++ features/run_manual.feature | 26 ++++++++++++++++++++++++++ features/run_scheduled.feature | 9 +++++++++ 3 files changed, 44 insertions(+) create mode 100644 features/hard_kill.feature create mode 100644 features/run_manual.feature create mode 100644 features/run_scheduled.feature diff --git a/features/hard_kill.feature b/features/hard_kill.feature new file mode 100644 index 0000000..d411f38 --- /dev/null +++ b/features/hard_kill.feature @@ -0,0 +1,9 @@ +Feature: + + Scenario: Manual kill + Given a long-running workflow + And I am logged in as an authorized user + And the workflow is running + When I kill the workflow using the GitHub UI + Then the workflow should stop + And the instance should terminate diff --git a/features/run_manual.feature b/features/run_manual.feature new file mode 100644 index 0000000..ab8e219 --- /dev/null +++ b/features/run_manual.feature @@ -0,0 +1,26 @@ +Feature: Manual runs of the workflow + + A user should be able to manually launch a workflow from the web UI. + [Mechanism: workflow_dispatch and run workflow] + + Scenario: Authorized users should see the run workflow button + Given I have a workflow generated with our tool + And I am logged in as an authorized user + When I load the workflow's page + Then I should see the Run Workflow button + + Scenario: Unauthorized users should not see the run worklow button + Given I have a workflow generated with our tool + And I am logged in as an unauthorized user + When I load the workflow's page + Then I should not see the Run Workflow button + + Scenario: Running the Run Workflow button should run the workflow + Given I have a workflow generated with our tool + And I am logged in as an authorized user + When I load the workflow's page + And I press the Run Workflow button + Then the workflow should complete a manual run + + + diff --git a/features/run_scheduled.feature b/features/run_scheduled.feature new file mode 100644 index 0000000..50ea5d2 --- /dev/null +++ b/features/run_scheduled.feature @@ -0,0 +1,9 @@ +Feature: Scheduled runs of the workflow + + A user should be able to run scheduled runs of a workflow + + Scenario: A scheduled run should run + Given I have a workflow generated with our tool + When I wait until after the scheduled run time + Then the workflow should have completed a scheduled run + From cf20dd746a4737d34b10ff8c8197cfe6f1fe6479 Mon Sep 17 00:00:00 2001 From: "David W.H. Swenson" Date: Fri, 19 Apr 2024 16:28:54 -0500 Subject: [PATCH 02/12] Added select_platform (covers a few user stories) --- features/README.md | 5 +++++ features/hard_kill.feature | 5 ++++- features/select_platform.feature | 35 ++++++++++++++++++++++++++++++++ 3 files changed, 44 insertions(+), 1 deletion(-) create mode 100644 features/README.md create mode 100644 features/select_platform.feature diff --git a/features/README.md b/features/README.md new file mode 100644 index 0000000..1cf9a99 --- /dev/null +++ b/features/README.md @@ -0,0 +1,5 @@ +# features + +This directory contains feature descriptions written in +[Gherkin](https://cucumber.io/docs/gherkin/). Please note that we should have +one feature per file. diff --git a/features/hard_kill.feature b/features/hard_kill.feature index d411f38..79c6f3d 100644 --- a/features/hard_kill.feature +++ b/features/hard_kill.feature @@ -1,4 +1,7 @@ -Feature: +Feature: Hard kill a runaway workflow job + + A user should be able to kill a running job, and that should also + terminate the associated instance. Scenario: Manual kill Given a long-running workflow diff --git a/features/select_platform.feature b/features/select_platform.feature new file mode 100644 index 0000000..f7699c1 --- /dev/null +++ b/features/select_platform.feature @@ -0,0 +1,35 @@ +Feature: Select platform to run on + + A user should be able to select the hardware that suits the needs of + their run. + + Scenario: Running with large memory + Given a workflow that requires and requests a large-memory host + When I run the workflow + Then it should run on the appropriate large-memory host + + Scenario: Running with a single CUDA GPU + Given a workflow that requires and requests a single CUDA GPU + When I run the workflow + Then it should run on hardware with a GPU + And my software should be able to interact with the CUDA drivers + + Scenario: Running with multiple GPUs + Given a workflow that requires and request multiple GPUs + When I run the workflow + Then it should run on hardware with multiple GPUs + And my software should be able to interact with all requested GPUs + + Scenario: Running with smaller hardware + Given a workflow that requests lower-cost hardware + When I run the workflow + Then it should run on the appropriate hardware + + Scenario: Running with preemptible instances + Given a workflow that can run on preemptible hosts + When I run the workflow + Then it should run on a preemptible host + # NOTE: anything about recovering from preemption is the + # responsibility of the workflow writer + + #Scenario: Running with ROCM stack From c1334bee0be4e283bbc880b2533d0776773ed766 Mon Sep 17 00:00:00 2001 From: "David W.H. Swenson" Date: Mon, 22 Apr 2024 16:02:02 -0500 Subject: [PATCH 03/12] more features --- features/code_coverage.feature | 22 +++++++++++++++++ features/physical_cost.feature | 12 ++++++++++ features/prevent_abuse.feature | 39 +++++++++++++++++++++++++++++++ features/quickstart.feature | 10 ++++++++ features/reproducible_env.feature | 10 ++++++++ features/set_gpu_mode.feature | 12 ++++++++++ 6 files changed, 105 insertions(+) create mode 100644 features/code_coverage.feature create mode 100644 features/physical_cost.feature create mode 100644 features/prevent_abuse.feature create mode 100644 features/quickstart.feature create mode 100644 features/reproducible_env.feature create mode 100644 features/set_gpu_mode.feature diff --git a/features/code_coverage.feature b/features/code_coverage.feature new file mode 100644 index 0000000..f995dae --- /dev/null +++ b/features/code_coverage.feature @@ -0,0 +1,22 @@ +Feature: Code coverage + + Our workflow should be able to report code coverage to external + services. (For testing, we'll just be sure we can integrate with + CodeCov.) + + Scenario: Report default coverage + # Note that this is probably NOT what most users will want. Imagine + # that our runner, because it is on GPU, runs more code paths than + # the basic runs, and runs less frequently. This means that PRs (not + # using our runner) will see spurious decrease in coverage. + Given a workflow that uses CodeCov for coverage + When I run the workflow + Then coverage should successfully be updated on CodeCov + + Scenario: Report coverage with CodeCov flags + # Using CodeCov flags may help solve the problem mentioned in the + # default coverage scenario, but we should play with it a bit to + # determine a recommended practice. (Out of scope for MVP.) + Given a workflow that uses CodeCov flags for coverage + When I run then workflow + Then the correct flag should be updated on CodeCov diff --git a/features/physical_cost.feature b/features/physical_cost.feature new file mode 100644 index 0000000..4487727 --- /dev/null +++ b/features/physical_cost.feature @@ -0,0 +1,12 @@ +Feature: Track physical cost of running + + The amount of time that has been used (or ideally, the actual cost + incurred) should be easily accessible. + [Possible mechanisms: (1) Refer to AWS billing info; (2) use an API to + extract stuff from AWS billing / CloudTrail; (3) have some custom + cloud-independent approach -- probably (1) or (2)] + + Scenario: + # TODO: having trouble with this one because I feel like it depends + # on the specific mechanism + diff --git a/features/prevent_abuse.feature b/features/prevent_abuse.feature new file mode 100644 index 0000000..9d8fb3e --- /dev/null +++ b/features/prevent_abuse.feature @@ -0,0 +1,39 @@ +Feature: Safeguards to prevent abuse of self-hosted runners + + Compute resources should be protected from abuse by malicious actors. + This includes preventing forks from accessing our resources and includes + preventing runs on untrusted PRs. + + Scenario: Forks should not be able to use our runners + # This should be guaranteed by the fact that secrets don't propagate + # to forks. + Given a malicious actor's fork of our repository + When the actor tries to run using workflow dispatch + Then the workflow should give an error due to authorization + And the workflow should fail to start instances on AWS + + Scenario: Pull requests from first-time contributors should not start runners + # With default repo settings, first-time contributors should require + # approval to run CI at all. + Given a malicious actor's fork of our repository + And the actor has not previously contributed to our repository + And the actor has changed our workflow to run on PRs + When the actor creates a pull request to our repository + Then the workflow should give an error due to authorization + And the workflow should fail to start instances on AWS + + Scenario: Pull requests from previous contributors should not start runners + # With default repo settings, an external contributor who has + # previously contributed no longer requires approval for CI to run. + # However, this should be guaranteed because PRs from forks don't + # have access to secrets. + Given a malicious actor's fork of our repository + And the actor has previously contributed to our repository + And the actor has changed our workflow to run on PRs + When the actor creates a pull request to our repository + Then the workflow should give an error due to authorization + And the workflow should fail to start instances on AWS + + # Non-tested scenario: AWS tokens (as secrets) should not leak in PRs + # from forks because forks don't see secrets. (Leaking AWS tokens is + # a different attack vector from the ones described above.) diff --git a/features/quickstart.feature b/features/quickstart.feature new file mode 100644 index 0000000..13e1463 --- /dev/null +++ b/features/quickstart.feature @@ -0,0 +1,10 @@ +Feature: Quickstart guide + + There should be a quick and easy way to set up workflows, and a simple + demo workflow. + + Scenario: Easy set-up of for first-time users + Given I have AWS credentials + And I have not previously set up AWS infra for this tool + When I use the quickstart command + Then I should have a working workflow diff --git a/features/reproducible_env.feature b/features/reproducible_env.feature new file mode 100644 index 0000000..669d703 --- /dev/null +++ b/features/reproducible_env.feature @@ -0,0 +1,10 @@ +Feature: Reproducible workflow environment + + Within a version of our tool and a specific cloud machine image, the + starting environment for all workflows should be the same. + + Scenario: Reproducible workflow environment + Given a fixed version of our tool and of a cloud machine image + When I start the workflow + Then the versions of important libraries should be as expected + And the versions of important software tools should be as expected diff --git a/features/set_gpu_mode.feature b/features/set_gpu_mode.feature new file mode 100644 index 0000000..d4ece05 --- /dev/null +++ b/features/set_gpu_mode.feature @@ -0,0 +1,12 @@ +Feature: Workflow should be able to set the GPU compute mode + + A given workflow should be able to use different GPU compute modes + (e.g., EXCLUSIVE_PROCESS). + [Mechanism: This might be either via machine selection or by setting + mode in the workflow] + + Scenario: Run in EXCLUSIVE_PROCESS + Given a workflow that should run with EXCLUSIVE_PROCESS set + When I run the workflow + Then my main process should take the GPU + And any other process should error if it tries to use the GPU From 3c8214ead0070c78ae088138d9f407d564cd5a10 Mon Sep 17 00:00:00 2001 From: "David W.H. Swenson" Date: Mon, 22 Apr 2024 18:11:52 -0500 Subject: [PATCH 04/12] yet more gherkin features --- features/retrieve_results.feature | 17 +++++++++++++++++ features/run_matrix.feature | 11 +++++++++++ features/run_pr.feature | 14 ++++++++++++++ 3 files changed, 42 insertions(+) create mode 100644 features/retrieve_results.feature create mode 100644 features/run_matrix.feature create mode 100644 features/run_pr.feature diff --git a/features/retrieve_results.feature b/features/retrieve_results.feature new file mode 100644 index 0000000..ef39a01 --- /dev/null +++ b/features/retrieve_results.feature @@ -0,0 +1,17 @@ +Feature: Retrieve results of a benchmarking run + + A user may generate data during a run that they want to save somewhere + long-term. This will require that the user explicitly store that data + somewhere; in this, we will test that we can store it. + + Scenario: Store results to an S3 bucket + Given a workflow that intends to upload a file to an S3 bucket + When I run the workflow + Then the file should be uploaded to the S3 bucket + + Scenario: Store results to Dropbox + # we do a separate test for Dropbox just to ensure that there's + # nothing special happening because S3 and EC2 are both AWS + Given a workflow that intends to upload a file to Dropbox + When I run the workflow + Then the file should be uploaded to Dropbox diff --git a/features/run_matrix.feature b/features/run_matrix.feature new file mode 100644 index 0000000..96df69e --- /dev/null +++ b/features/run_matrix.feature @@ -0,0 +1,11 @@ +Feature: Run a matrix build + + A user should be able to run a full build matrix (ideally in parallel). + + Scenario: Run a matrix + Given a workflow that involves a complicated matrix + When I run the workflow + Then all builds in the matrix should complete + # maybe this too: + # And an instance should be launched for each job + # And all jobs should run on different instances diff --git a/features/run_pr.feature b/features/run_pr.feature new file mode 100644 index 0000000..a5419b9 --- /dev/null +++ b/features/run_pr.feature @@ -0,0 +1,14 @@ +Feature: Run on pull requests + + A user should be able to run a workflow on self-hosted runners prior to + merging a pull request. NOTE: This will *not* use the normal + pull_request trigger for workflows. Instead, this will be a + workflow_dispatch caused by some external decision. This is because we + don't expect to want to run expensive CI on every commit, but rather + when an admin chooses to. + + Scenario: Choose to run a workflow on a PR + Given I have a workflow generated with our tool + And a pull request is open against that repository + When I [trigger the workflow to run on the PR] (how? TBD) + Then the workflow runs on our runner using code in the PR From 09123a1d06eb02a8318903f384c726520d6182b7 Mon Sep 17 00:00:00 2001 From: "David W.H. Swenson" Date: Tue, 23 Apr 2024 12:27:14 -0500 Subject: [PATCH 05/12] more features, marking a TODO --- features/external_users.feature | 12 ++++++++++++ features/quickstart.feature | 3 +++ 2 files changed, 15 insertions(+) create mode 100644 features/external_users.feature diff --git a/features/external_users.feature b/features/external_users.feature new file mode 100644 index 0000000..22f3eae --- /dev/null +++ b/features/external_users.feature @@ -0,0 +1,12 @@ +Feature: Allow external contributors to use resources + + ... + + # NOTE: this is essentially the same as a scenario from the run_pr + # feature; might not ever fill it in + #Scenario: Authorized user permits a PR from unauthorized user to run + + Scenario: Adding a new authorized user + Given an unauthorized user who should become authorized + When I give the user committer access to the repository + Then the user should have the ability to launch self-hosted workflows diff --git a/features/quickstart.feature b/features/quickstart.feature index 13e1463..3304a13 100644 --- a/features/quickstart.feature +++ b/features/quickstart.feature @@ -3,6 +3,9 @@ Feature: Quickstart guide There should be a quick and easy way to set up workflows, and a simple demo workflow. + # TODO: There should be a scenario here about documentation, maybe? or + # is that another feature? Up-to-date getting started documentation. + Scenario: Easy set-up of for first-time users Given I have AWS credentials And I have not previously set up AWS infra for this tool From bcf83d2625dcfeb047f0690e7430aed64abfe0d0 Mon Sep 17 00:00:00 2001 From: Ethan Holz Date: Tue, 23 Apr 2024 15:45:39 -0500 Subject: [PATCH 06/12] feat: added scenario for up to date and tested docs --- features/quickstart.feature | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/features/quickstart.feature b/features/quickstart.feature index 3304a13..ff51678 100644 --- a/features/quickstart.feature +++ b/features/quickstart.feature @@ -11,3 +11,8 @@ Feature: Quickstart guide And I have not previously set up AWS infra for this tool When I use the quickstart command Then I should have a working workflow + + Scenario: Up-to date documentation + Given I have the latest version of the tool + When I look at the documentation + Then I should see up-to-date and tested information From 0f7df09451824809137c72c3df64c72a56005c45 Mon Sep 17 00:00:00 2001 From: Ethan Holz Date: Tue, 23 Apr 2024 16:02:54 -0500 Subject: [PATCH 07/12] feat: added a scenario for cost --- features/physical_cost.feature | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/features/physical_cost.feature b/features/physical_cost.feature index 4487727..9b3da6c 100644 --- a/features/physical_cost.feature +++ b/features/physical_cost.feature @@ -10,3 +10,11 @@ Feature: Track physical cost of running # TODO: having trouble with this one because I feel like it depends # on the specific mechanism + # WIP: I think this is the generic form of this information + # the mechanism for tracking the cost is not specified here. + Scenario: When I run a test, I can see how much it costs + Given I have a test that runs for X amount of time + And I have a cost of Y per unit time + And I have a mechanism for tracking the cost + When I run the test + Then I receive a caclulated cost of running the test. From b9e57e3f832d31cb9654ea37005ed0378df76998 Mon Sep 17 00:00:00 2001 From: "David W.H. Swenson" Date: Wed, 24 Apr 2024 09:32:45 -0500 Subject: [PATCH 08/12] Removed assumption that fork owner is malicious --- features/prevent_abuse.feature | 27 ++++++++++++++------------- 1 file changed, 14 insertions(+), 13 deletions(-) diff --git a/features/prevent_abuse.feature b/features/prevent_abuse.feature index 9d8fb3e..c34d741 100644 --- a/features/prevent_abuse.feature +++ b/features/prevent_abuse.feature @@ -1,24 +1,25 @@ Feature: Safeguards to prevent abuse of self-hosted runners - Compute resources should be protected from abuse by malicious actors. - This includes preventing forks from accessing our resources and includes - preventing runs on untrusted PRs. + Compute resources should be protected from use outside of intended runs, + either due to accidental triggering or due to intential abuse by + malicious actors. This includes preventing forks from accessing our + resources and includes preventing runs on untrusted PRs. Scenario: Forks should not be able to use our runners # This should be guaranteed by the fact that secrets don't propagate # to forks. - Given a malicious actor's fork of our repository - When the actor tries to run using workflow dispatch + Given a fork of a repository with a self-hosted workflow + When the fork owner tries to run (within fork) using workflow dispatch Then the workflow should give an error due to authorization And the workflow should fail to start instances on AWS Scenario: Pull requests from first-time contributors should not start runners # With default repo settings, first-time contributors should require # approval to run CI at all. - Given a malicious actor's fork of our repository - And the actor has not previously contributed to our repository - And the actor has changed our workflow to run on PRs - When the actor creates a pull request to our repository + Given a fork of a repository with a self-hosted workflow + And the fork owner has not previously contributed to the repository + And the fork owner has changed our workflow to run on PRs + When the fork owner creates a pull request to our repository Then the workflow should give an error due to authorization And the workflow should fail to start instances on AWS @@ -27,10 +28,10 @@ Feature: Safeguards to prevent abuse of self-hosted runners # previously contributed no longer requires approval for CI to run. # However, this should be guaranteed because PRs from forks don't # have access to secrets. - Given a malicious actor's fork of our repository - And the actor has previously contributed to our repository - And the actor has changed our workflow to run on PRs - When the actor creates a pull request to our repository + Given a fork of a repository with a self-hosted workflow + And the fork owner has previously contributed to the repository + And the fork owner has changed our workflow to run on PRs + When the fork owner creates a pull request to our repository Then the workflow should give an error due to authorization And the workflow should fail to start instances on AWS From 25494f0f105f8481fdf2d51961aa620903d1bb0a Mon Sep 17 00:00:00 2001 From: Ethan Holz Date: Wed, 24 Apr 2024 14:01:51 -0500 Subject: [PATCH 09/12] feat: added ROCm scenario --- features/select_platform.feature | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/features/select_platform.feature b/features/select_platform.feature index f7699c1..50a16a0 100644 --- a/features/select_platform.feature +++ b/features/select_platform.feature @@ -32,4 +32,8 @@ Feature: Select platform to run on # NOTE: anything about recovering from preemption is the # responsibility of the workflow writer - #Scenario: Running with ROCM stack + # NOTE: This is not an MVP requirement + Scenario: Running with ROCm stack + Given a workflow that requires an ROCm stack + When I run the workflow + Then it should run on hardware with the appropriate ROCm stack From 2326f89dab0639d6a7caa8fd65dd2fffe9553630 Mon Sep 17 00:00:00 2001 From: Ethan Holz Date: Wed, 24 Apr 2024 14:03:15 -0500 Subject: [PATCH 10/12] feat: added inferencing/ML scenarios --- features/select_platform.feature | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/features/select_platform.feature b/features/select_platform.feature index 50a16a0..177a5c2 100644 --- a/features/select_platform.feature +++ b/features/select_platform.feature @@ -33,7 +33,19 @@ Feature: Select platform to run on # responsibility of the workflow writer # NOTE: This is not an MVP requirement - Scenario: Running with ROCm stack - Given a workflow that requires an ROCm stack + Scenario: Running with ROCM stack + Given a workflow that requires an ROCM stack When I run the workflow - Then it should run on hardware with the appropriate ROCm stack + Then it should run on hardware with the appropriate ROCM stack + + Scenario: Running with an inference stack with various hardware + Given a workflow that requires an inference stack + When I run the workflow + Then it should run on hardware with the appropriate inference stack + And my software should be able to interact with the inference stack + + Scenario: Running a small ML training run + Given a workflow that requires an inference stack + And the workflow is a small ML training run + When I run the workflow + Then it should run on hardware with the appropriate inference stack From 9dbc279040a46b0b46776b76f032d093a715accc Mon Sep 17 00:00:00 2001 From: "David W.H. Swenson" Date: Wed, 24 Apr 2024 20:54:21 -0500 Subject: [PATCH 11/12] commenting out ROCM (non-MVP) feature --- features/select_platform.feature | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/features/select_platform.feature b/features/select_platform.feature index 177a5c2..a871f9a 100644 --- a/features/select_platform.feature +++ b/features/select_platform.feature @@ -33,10 +33,10 @@ Feature: Select platform to run on # responsibility of the workflow writer # NOTE: This is not an MVP requirement - Scenario: Running with ROCM stack - Given a workflow that requires an ROCM stack - When I run the workflow - Then it should run on hardware with the appropriate ROCM stack + #Scenario: Running with ROCM stack + # Given a workflow that requires an ROCM stack + # When I run the workflow + # Then it should run on hardware with the appropriate ROCM stack Scenario: Running with an inference stack with various hardware Given a workflow that requires an inference stack From 994c34bd680c4320594b65853037298f3b044804 Mon Sep 17 00:00:00 2001 From: "David W.H. Swenson" Date: Wed, 24 Apr 2024 21:15:50 -0500 Subject: [PATCH 12/12] Better scenarios on workflow preemption --- features/select_platform.feature | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/features/select_platform.feature b/features/select_platform.feature index a871f9a..6b0bc0d 100644 --- a/features/select_platform.feature +++ b/features/select_platform.feature @@ -29,9 +29,21 @@ Feature: Select platform to run on Given a workflow that can run on preemptible hosts When I run the workflow Then it should run on a preemptible host - # NOTE: anything about recovering from preemption is the + # NOTE: anything about continuing from preemption is the # responsibility of the workflow writer + Scenario: A run on a preemptible instance is preempted + Given a workflow that can run on preemptible hosts + And the workflow is running + When the workflow is preempted + Then the workflow should be retried (up to a specified retry limit) + + Scenario: True failures should not be retried on preemtible instances + Given a workflow that can run on preemptible hosts + And the workflow is running + When the workflow fails + Then the workflow should not be retried + # NOTE: This is not an MVP requirement #Scenario: Running with ROCM stack # Given a workflow that requires an ROCM stack