Skip to content

Anantys-oss/ci-orchestration

Repository files navigation

CI Orchestration - GitHub Actions Runner Autoscaler

Scale GitHub Actions self-hosted runners on Railway from idle to N replicas based on demand. Near-zero cost when no jobs are running.

The problem

Running self-hosted runners on Railway means paying for containers 24/7, even when no CI jobs are running. With 8 replicas, that's 8x the cost sitting idle most of the day.

The solution

This tiny service (< 5 MB RAM) sits between GitHub and Railway. It listens for GitHub webhooks and scales your runner service up and down automatically:

                     workflow_job webhook
  GitHub  ──────────────────────────────────>  Scaler (always on, tiny)
                                                  │
                                                  │ Railway API
                                                  v
                                               Runner service
                                               1 replica (idle)
                                               ──────────────────
                                               N replicas (CI running)

No jobs queued --> 1 idle replica (Railway minimum), containers dead, near-zero cost Jobs queued --> scales up to match demand, restarts containers Jobs done --> scales back to 1 after a grace period

Quick start

See INSTALL.md for the full step-by-step setup guide covering:

  • GitHub side: PAT creation, webhook configuration
  • Railway side: runner service, scaler service, environment variables
  • Verification and troubleshooting

Configuration

All configuration is done via environment variables on the scaler service.

Variable Required Default Description
RAILWAY_API_TOKEN Yes - Railway project token to control runner replicas
TARGET_SERVICE_ID Yes - Railway service ID of the runner service
TARGET_ENVIRONMENT_ID No - Railway environment ID (overrides auto-injected RAILWAY_ENVIRONMENT_ID)
WEBHOOK_SECRET No - GitHub webhook secret for signature verification
MAX_REPLICAS No 8 Maximum number of runner replicas
SCALE_DOWN_DELAY_MS No 30000 Wait time (ms) before scaling down after last job
RUNNER_LABEL No railway Label to filter which jobs trigger scaling
PORT No 3000 HTTP port for the scaler service
GITHUB_TOKEN No - GitHub PAT for startup sync and periodic reconciliation
GITHUB_REPO No - GitHub repo (owner/repo) — required with GITHUB_TOKEN
SYNC_INTERVAL_MS No 900000 Reconciliation interval in ms (default 15 min)

Note on environment ID: Railway auto-injects RAILWAY_ENVIRONMENT_ID into every service with the service's own environment ID. If the scaler and runner are in the same environment (typical), you don't need to set TARGET_ENVIRONMENT_ID -- the auto-injected value works. Set TARGET_ENVIRONMENT_ID only if the runner is in a different environment.

API endpoints

Method Path Description
POST /webhook GitHub webhook receiver
GET /health Current state: job counts, replica count, config
GET /logs Last 50 scaling events with timestamps

Health check example

curl https://<your-scaler>.up.railway.app/health
{
  "status": "ok",
  "uptime": 3600,
  "queuedJobs": 0,
  "activeJobs": 0,
  "currentReplicas": 1,
  "maxReplicas": 8,
  "scaleDownPending": false,
  "syncEnabled": true
}

How it works under the hood

Scaling lifecycle

  1. You configure a GitHub webhook that sends workflow_job events to this service
  2. When a job targeting your self-hosted runners is queued, the scaler adds +1 replica (incremental, not jump-to-total)
  3. The scaler then restarts the runner deployment so containers come alive (ephemeral runners exit after each job)
  4. Each runner starts with EPHEMERAL=true -- it registers with GitHub, picks up one job, executes it, then exits cleanly
  5. When a job is completed, the scaler decrements the count and schedules a gradual scale-down
  6. Scale-down removes -1 replica every SCALE_DOWN_DELAY_MS until replicas match the remaining job count (minimum 1)
  7. New jobs arriving during scale-down cancel the timer, keeping spare replicas warm

Concurrency handling

Multiple webhooks arriving simultaneously are handled safely:

  • A scaling mutex prevents concurrent Railway API calls -- extra requests are deferred and coalesced
  • During scale-down, new jobs correctly detect that replicas are being reduced and trigger a scale-up
  • Deployment restarts only fire after the final replica count is committed, preventing partial-scale restarts
  • State is in-memory only -- with GITHUB_TOKEN/GITHUB_REPO configured, startup reconciliation restores accurate state; without them, the next webhook self-corrects

State reconciliation (optional)

Set GITHUB_TOKEN and GITHUB_REPO to enable automatic state sync:

  • Startup sync: On boot, the scaler queries Railway for the current replica count and GitHub for queued/active jobs. This prevents premature scale-down if the scaler restarts mid-build.
  • Periodic reconciliation: Every SYNC_INTERVAL_MS (default 15 min), the scaler re-syncs with both APIs and adjusts replicas if they've drifted from reality (e.g., due to missed webhooks).
  • Graceful degradation: If the GitHub token is invalid or the API is down, sync failures are logged as warnings and the scaler continues with its current state.

The same GitHub PAT used for the runner's ACCESS_TOKEN works here -- it just needs read access to workflow runs.

Railway healthcheck

This service ships with a railway.json that configures:

  • Healthcheck path: /health -- Railway pings this on every deploy to confirm the service is up before routing traffic
  • Healthcheck timeout: 30 seconds (the app starts in under 2 seconds, so this is generous)
  • Restart policy: ALWAYS with 3 max retries -- if the scaler crashes, Railway restarts it automatically

These settings are picked up automatically when you deploy from this directory. No manual configuration needed in the Railway dashboard.

Workflow configuration

Your GitHub Actions workflows must target the self-hosted label:

jobs:
  build:
    runs-on: [self-hosted, railway]
    steps:
      - uses: actions/checkout@v4
      - run: echo "Running on an autoscaled Railway runner"

The second label (railway) must match the LABELS env var on the runner service and the RUNNER_LABEL env var on the scaler.

About

Github CI Orchestration for Railway

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages