Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 129 additions & 0 deletions docs/internal/adr/0007-cicd-deployment-strategy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# ADR-0007: CI/CD Deployment Strategy for Firewalled OpenShift Clusters

**Date:** 2026-03-04
**Authors:** Ken Dreyer (with Gemini 3 Pro)
**Deciders:** Platform Team

## Context

We currently host our app on public GitHub and deploy to a public ROSA cluster using GitHub Actions and the `oc` CLI. We are moving to a firewalled OpenShift cluster (MP+), which breaks our current public GitHub runner workflow.

We need to preserve two properties:

1. **Immediacy:** When a developer merges code to `main`, we must deploy immediately without waiting for a polling cycle.
2. **Visibility:** Developers need direct access to deployment logs to troubleshoot their own issues without Ops help.

## Decision

We will run self-hosted GitHub Actions runners inside our firewalled OpenShift cluster. The runners make outbound connections to GitHub, pick up jobs, and execute them locally. Because they live inside the cluster and firewall, they can talk directly to the OpenShift API via `oc`.

We are rolling this out in two phases.

This week I have already moved forward with trialing Phase 1.

### Phase 1: Standalone Runner

Deploy a standalone GitHub Actions runner as a regular `Deployment` — no CRDs, no `ClusterRoles`, no operator.

**How it works:**

* A `Deployment` with `replicas: 1` runs the [GitHub Actions runner agent](https://github.com/actions/runner/pkgs/container/actions-runner).
* At startup, the runner uses GitHub App credentials (App ID + private key) to generate a short-lived registration token and register itself with GitHub.
* The runner's `ServiceAccount` only needs the permissions our CI jobs require (e.g. `oc apply` to target namespaces). It does not need any cluster-level permissions.
Comment on lines +18 to +32
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Document the trust boundary for these in-cluster runners.

Because the repo is public, any workflow that can land on this self-hosted runner inherits cluster reachability plus the ServiceAccount permissions from Line 32. The ADR should make the guardrails part of the decision: only trusted deploy workflows/branches may target these runners, PR/fork jobs must stay off them, and protected environments/approvals should gate cluster access. Without that, this design creates a direct path from untrusted workflow code into the firewalled cluster.

Suggested ADR addition
 **How it works:**
 
 * A `Deployment` with `replicas: 1` runs the [GitHub Actions runner agent](https://github.com/actions/runner/pkgs/container/actions-runner).
 * At startup, the runner uses GitHub App credentials (App ID + private key) to generate a short-lived registration token and register itself with GitHub.
 * The runner's `ServiceAccount` only needs the permissions our CI jobs require (e.g. `oc apply` to target namespaces). It does not need any cluster-level permissions.
+* **Security guardrails:** Only trusted deploy workflows on protected branches/environments may target this runner. PRs, fork-based workflows, and other untrusted jobs must use GitHub-hosted runners and must not receive cluster credentials.

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
We will run self-hosted GitHub Actions runners inside our firewalled OpenShift cluster. The runners make outbound connections to GitHub, pick up jobs, and execute them locally. Because they live inside the cluster and firewall, they can talk directly to the OpenShift API via `oc`.
We are rolling this out in two phases.
This week I have already moved forward with trialing Phase 1.
### Phase 1: Standalone Runner
Deploy a standalone GitHub Actions runner as a regular `Deployment` — no CRDs, no `ClusterRoles`, no operator.
**How it works:**
* A `Deployment` with `replicas: 1` runs the [GitHub Actions runner agent](https://github.com/actions/runner/pkgs/container/actions-runner).
* At startup, the runner uses GitHub App credentials (App ID + private key) to generate a short-lived registration token and register itself with GitHub.
* The runner's `ServiceAccount` only needs the permissions our CI jobs require (e.g. `oc apply` to target namespaces). It does not need any cluster-level permissions.
We will run self-hosted GitHub Actions runners inside our firewalled OpenShift cluster. The runners make outbound connections to GitHub, pick up jobs, and execute them locally. Because they live inside the cluster and firewall, they can talk directly to the OpenShift API via `oc`.
We are rolling this out in two phases.
This week I have already moved forward with trialing Phase 1.
### Phase 1: Standalone Runner
Deploy a standalone GitHub Actions runner as a regular `Deployment` — no CRDs, no `ClusterRoles`, no operator.
**How it works:**
* A `Deployment` with `replicas: 1` runs the [GitHub Actions runner agent](https://github.com/actions/runner/pkgs/container/actions-runner).
* At startup, the runner uses GitHub App credentials (App ID + private key) to generate a short-lived registration token and register itself with GitHub.
* The runner's `ServiceAccount` only needs the permissions our CI jobs require (e.g. `oc apply` to target namespaces). It does not need any cluster-level permissions.
* **Security guardrails:** Only trusted deploy workflows on protected branches/environments may target this runner. PRs, fork-based workflows, and other untrusted jobs must use GitHub-hosted runners and must not receive cluster credentials.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/internal/adr/0007-cicd-deployment-strategy.md` around lines 18 - 32, The
ADR is missing a clear trust-boundary and guardrails for the in-cluster GitHub
Actions runners—update the "Phase 1: Standalone Runner" section to explicitly
document that any workflow running on these self-hosted runners inherits cluster
network reachability and the ServiceAccount permissions (e.g., oc apply), and
add concrete constraints: only allow trusted branches/workflows to target the
Deployment-backed runner, disallow PR/fork jobs from using these runners,
require protected environments/approved reviewers or environment approvals
before workflows can perform cluster actions, and note that ServiceAccount RBAC
must be scoped to the minimum required permissions; reference the Deployment,
ServiceAccount, and oc usage in the text so reviewers can find and verify the
guardrails.


**Pros:**

* No CRDs or cluster-level RBAC — deploys with namespace-scoped permissions only
* No IT approval needed — can deploy immediately
* Architecturally simple — a single long-running pod
* Identical developer experience — jobs appear in the GitHub Actions UI the same way
* Serialized deploys — jobs run one at a time, so concurrent merges cannot trample each other in prod

**Cons:**

* No auto-scaling — the runner pod is always running regardless of job queue depth
* Single point of failure — if the pod crashes, jobs silently queue instead of running
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • No job isolation. Every job uses the environment (pod) from all previous jobs, until something or someone outside GitHub restarts that pod.


### Phase 2: Actions Runner Controller (ARC)

If IT approves CRD installation in preprod/prod (requires a ServiceNow ticket), we can upgrade to the [Actions Runner Controller (ARC)](https://github.com/actions/actions-runner-controller) operator. ARC dynamically creates and destroys runner pods based on the job queue.

**What ARC adds over the standalone runner:**

* Auto-scaling — runner pods scale up and down based on demand, saving compute resources
* Multi-runner — can run multiple jobs concurrently. *We would need to investigate how to prevent concurrent deploy jobs from trampling each other in prod.*

**What ARC requires:**

* Custom Resource Definitions installed in the cluster (IT approval)
* Cluster-level RBAC for the operator
* Ongoing maintenance and patching of the ARC operator

Phase 2 depends on IT approving CRDs.
* If IT *does not* approve, we remain on Phase 1.
* If IT *does approve*, we will retire standalone GH Action Runner `Deployment` and replace it with ARC.

## Considered Options

### Option 1: Self-hosted GitHub Actions runners (standalone or ARC) — chosen

See Phase 1 and Phase 2 above.

### Option 2: OpenShift GitOps (ArgoCD) with an Ingress Tunnel — rejected

Rejected because it forces developers to learn a new UI (ArgoCD) to view their logs. It also requires punching a hole in the firewall for the webhook, which is needlessly complex and less secure compared to the outbound-only model.

### Option 3: VPN/SSH/Network Overlay with GitHub Actions — rejected

Rejected due to operational complexity and security concerns with maintaining persistent network tunnels into the firewalled cluster.

## Consequences

**Positive:**

* Developers keep their existing GitHub Actions workflow and can debug deployments without Ops
* No inbound firewall ports — aligns with Infosec standards
* Instant job pickup preserves deployment velocity
* Phase 1 can deploy immediately with no IT dependencies

**Negative:**

* We must pay for and manage the compute resources for the runner(s)
* Dependency on GitHub App credentials with periodic rotation
* Phase 1 has no auto-scaling and no built-in redundancy

**Risks:**

* No alerting on standalone runner failure. If the pod crashes, deployments silently stop and GitHub Actions jobs queue instead of running. We need monitoring to detect this.
* IT may not approve CRDs for Phase 2, leaving us on Phase 1 permanently.
* Moving to Phase 2 introduces concurrent deploy jobs. We would need to investigate serialization or locking to prevent jobs from trampling each other in prod.

## Risks of Remaining on Phase 1

If we cannot move to Phase 2, the standalone runner carries ongoing operational risks:

* **Single point of failure.** One pod handles all CI jobs. If it crashes or is evicted, no jobs run until the pod restarts, without notifying a person or agentic process.
* **No concurrency.** Jobs run sequentially, which prevents deploy races but increases latency when multiple PRs merge quickly.
* **No auto-scaling.** The runner pod runs continuously regardless of load — wasting resources when idle, unable to scale during bursts.
* **Manual recovery.** If the runner loses its GitHub registration (e.g. after a credential rotation or a prolonged outage), someone must re-register it manually.
* **No built-in high availability.** Running multiple replicas of the standalone runner may cause conflicts with job pickup and GitHub registration. A high-availability solution would require further investigation.
Comment on lines +95 to +109
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make runner health monitoring a Phase 1 requirement, not just a listed risk.

The ADR currently accepts a failure mode where deployments stop silently. For the selected Phase 1 design, minimum observability should be part of the decision itself: alert on runner offline/registration failure, queued-job age, and pod unavailability. Otherwise releases can stall indefinitely without anyone noticing.

Suggested ADR addition
 **Risks:**
 
-* No alerting on standalone runner failure. If the pod crashes, deployments silently stop and GitHub Actions jobs queue instead of running. We need monitoring to detect this.
+* Standalone runner failure must be monitored explicitly. Before rollout, we need alerting for runner pod availability, runner registration/online status, and queued-job age so that deployment outages are detected quickly.
 * IT may not approve CRDs for Phase 2, leaving us on Phase 1 permanently.
 * Moving to Phase 2 introduces concurrent deploy jobs. We would need to investigate serialization or locking to prevent jobs from trampling each other in prod.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/internal/adr/0007-cicd-deployment-strategy.md` around lines 95 - 109,
Update the ADR to make runner health monitoring a mandatory Phase 1 requirement:
modify the Phase 1 decision/risk text around the "standalone runner" and "Risks
of Remaining on Phase 1" sections to require observability and alerting (not
just list it as a risk) and add explicit acceptance criteria describing alerts
for runner offline/registration failures, long queued-job age, and pod
unavailability; ensure the ADR states these monitoring checks must be
implemented before adopting Phase 1 and reference the standalone runner and
registration/queued-job/pod availability terms used in the document.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not at a point in this project's lifetime where we have monitoring infra into which this information could flow, nor a staff of AI devops agents or humans engineers to act upon it.

It's a good idea, but we don't have it presently.

We probably need a separate ADR for this.


## Concurrent Deploys

Phase 1 serializes deploys naturally — one runner, one job at a time. If we move to Phase 2 (ARC), concurrent runners could deploy conflicting changes simultaneously.

Before enabling concurrent runners, we must serialize deploy jobs to prevent parallel deploys from overwriting each other in prod. The GitHub Actions [`concurrency`](https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/control-the-concurrency-of-workflows-and-jobs) key handles this:

```yaml
concurrency:
group: deploy-prod
cancel-in-progress: false
```

This makes deploys sequential again, even with multiple runners — negating most of ARC's benefit. It is a safe starting point, not a long-term solution. If we later need parallel deploys, we will need a broader strategy to prevent conflicts (e.g. environment locking, progressive rollouts). We have not yet scoped that work.

## References

* [Actions Runner Controller](https://github.com/actions/actions-runner-controller)
* [GitHub ARC authentication docs](https://docs.github.com/en/actions/tutorials/use-actions-runner-controller/authenticate-to-the-api#deploying-using-github-app-authentication)
* [GitHub Actions runner image](https://github.com/actions/runner/pkgs/container/actions-runner)
Loading