ROCm · coketaste · Apr 26, 2026 · Apr 26, 2026 · Copilot · Apr 26, 2026
@@ -20,6 +20,7 @@ madengine is a modern CLI tool for running Large Language Models (LLMs) and Deep
 
 - [Key Features](#-key-features)
 - [Quick Start](#-quick-start)
+- [Smoke Testing](#-smoke-testing)
 - [Commands](#-commands)
 - [Documentation](#-documentation)
 - [Architecture](#-architecture)
@@ -80,6 +81,31 @@ madengine run --tags dummy --rocm-path /path/to/rocm
 
 **Results:** Performance data is written to `perf.csv` (and optionally `perf_entry.csv`). The file is created automatically if missing. Failed runs (including pre-run setup failures) are recorded with status `FAILURE` so every attempted model appears in the table. See [Exit Codes](docs/cli-reference.md#exit-codes) for CI/script usage.
 
+## 🧪 Smoke Testing
+
+Use the prebuilt smoke configs and wrapper script under `examples/` to quickly validate:
+
+- RDMA recommender on SLURM + Kubernetes
+- GCM preflight/collector on SLURM (phase-1 scope)
+
+```bash
+# SLURM smoke (build + run) + artifact verification
+examples/run-smoke.sh slurm MODEL_DIR=/path/to/model MODEL_TAG=your_tag
+examples/run-smoke.sh verify-slurm
+
+# Kubernetes smoke (build + run) + artifact verification
+examples/run-smoke.sh k8s MODEL_DIR=/path/to/model MODEL_TAG=your_tag
+examples/run-smoke.sh verify-k8s
+```
+
+Smoke assets:
+
+- `examples/run-smoke.sh`
+- `examples/Makefile.smoke`
+- `examples/slurm-configs/configs/smoke-rdma-gcm-slurm.json`
+- `examples/k8s-configs/configs/smoke-rdma-k8s.json`
+- `examples/cluster-smoke-checklist.md`
+
 ## 📋 Commands
 
 madengine provides five main commands for model automation and benchmarking:

@@ -119,6 +119,93 @@ madengine run --tags my_unit_test_suite \
 
 Disabling the scan does **not** change performance metric extraction from the log; it only affects the post-hoc grep used to set `has_errors` for status.
 
+## Cluster Feature Layer (`additional_context.cluster`)
+
+`cluster` is an additive feature-flag namespace for RDMA and (SLURM-only) GCM integration.
+Nothing changes unless you explicitly set `cluster.*.enabled: true`.
+
+### Schema
+
+```json
+{
+  "cluster": {
+    "rdma": {
+      "enabled": false,
+      "strict": false,
+      "mode": "recommend",
+      "apply_env": true,
+      "artifact_name": "rdma_recommendation.json"
+    },
+    "gcm": {
+      "enabled": false,
+      "enabled_platforms": ["slurm"],
+      "source": {
+        "repo": "https://github.com/coketaste/gcm",
+        "ref": "9fed02cd0721d3937f8749672951185f31955bd4"
+      },
+      "strict": false,
+      "health_checks": ["check-hca", "check-ibstat"],
+      "collector": {
+        "enabled": false,
+        "command": "slurm_job_monitor",
+        "once": true,
+        "sink": "file",
+        "timeout_sec": 120,
+        "max_retries": 1,
+        "best_effort": true
+      },
+      "artifacts": {
+        "dir": "./slurm_results/cluster_artifacts",
+        "files": {
+          "health_summary_json": "gcm_health_summary.json",
+          "health_raw_log": "gcm_health_raw.log",
+          "collector_output": "gcm_collector_output.log"
+        }
+      }
+    }
+  }
+}
+```
+
+### RDMA behavior (SLURM + Kubernetes)
+
+- `mode: "recommend"` keeps user `env_vars` precedence; only missing RDMA vars are injected.
+- `mode: "enforce"` lets recommender output override existing conflicting RDMA env vars.
+- `strict: true` fails the workload when no valid RDMA recommendation can be produced.
+- Artifacts are written per node/pod and included in deployment result summaries.
+
+### GCM behavior (SLURM only in this phase)
+
+- Health checks run in preflight (`check-hca`, `check-ibstat` allowlist only).
+- `strict: true` gates submission on health-check failures; `strict: false` warns and continues.
+- Collector runs as one-shot `gcm slurm_job_monitor --once` during result collection.
+- Collector defaults to best effort (`best_effort: true`) and does not gate workload success.
+- Source is pinned to `coketaste/gcm` with fixed commit ref for reproducibility checks.
+
+### Rollout guidance
+
+1. Start with `cluster.rdma.enabled=true`, `strict=false`, `mode="recommend"`.
+2. Validate RDMA artifacts and selected env vars on single-node, then multi-node.
+3. Enable `cluster.gcm.enabled=true` with `strict=false` to observe health output.
+4. Turn on `cluster.gcm.strict=true` only after cluster baseline is stable.
+5. Keep collector best-effort initially; tighten only after runtime overhead is validated.
+
+### Smoke configs and one-line runner
+
+Prebuilt smoke configs are available at:
+
+- `examples/slurm-configs/configs/smoke-rdma-gcm-slurm.json`
+- `examples/k8s-configs/configs/smoke-rdma-k8s.json`
+
+Run them with:
+
+```bash
+examples/run-smoke.sh slurm MODEL_DIR=/path/to/model MODEL_TAG=your_tag
+examples/run-smoke.sh k8s MODEL_DIR=/path/to/model MODEL_TAG=your_tag
+```
+
+Artifact verification commands are documented in `examples/cluster-smoke-checklist.md`.
+
 ## Basic Configuration
 
 **gpu_vendor** (case-insensitive):

@@ -1,6 +1,7 @@
 # Deployment Guide
 
 Deploy madengine workloads to Kubernetes or SLURM clusters for distributed execution.
+For quick end-to-end validation commands, see the README [Smoke Testing](../README.md#-smoke-testing) section.
 
 ## Overview
 
@@ -11,6 +12,41 @@ madengine supports two deployment backends:
 
 Deployment is configured via `--additional-context` and happens automatically during the run phase.
 
+## Cluster Feature Stages
+
+When `additional_context.cluster` is enabled, stage placement is:
+
+- **RDMA recommender**: runtime stage on both SLURM and Kubernetes (before workload launch).
+- **GCM health checks**: SLURM preflight stage before `sbatch` submission.
+- **GCM collector snapshot**: SLURM result-collection stage (post-run, one-shot).
+
+Current scope is intentionally phased:
+
+- RDMA: **SLURM + Kubernetes**
+- GCM: **SLURM only**
+
+## Cluster Smoke Runner
+
+For quick validation of the new cluster feature layer, use the smoke assets under `examples/`:
+
+- Configs:
+  - `examples/slurm-configs/configs/smoke-rdma-gcm-slurm.json`
+  - `examples/k8s-configs/configs/smoke-rdma-k8s.json`
+- One-line wrapper:
+  - `examples/run-smoke.sh`
+- Full checklist:
+  - `examples/cluster-smoke-checklist.md`
+
+Example:
+
+```bash
+examples/run-smoke.sh slurm MODEL_DIR=/path/to/model MODEL_TAG=your_tag
+examples/run-smoke.sh verify-slurm
+
+examples/run-smoke.sh k8s MODEL_DIR=/path/to/model MODEL_TAG=your_tag
+examples/run-smoke.sh verify-k8s
+```
+
 ## Deployment Workflow
 
 ```

@@ -0,0 +1,53 @@
+.PHONY: smoke-slurm smoke-k8s smoke-slurm-verify smoke-k8s-verify
+
+SLURM_SMOKE_CONFIG ?= examples/slurm-configs/configs/smoke-rdma-gcm-slurm.json
+K8S_SMOKE_CONFIG ?= examples/k8s-configs/configs/smoke-rdma-k8s.json
+
+SLURM_SMOKE_MANIFEST ?= build_manifest.slurm.smoke.json
+K8S_SMOKE_MANIFEST ?= build_manifest.k8s.smoke.json
+
+_require-model-vars:
+	@test -n "$(MODEL_DIR)" || (echo "ERROR: MODEL_DIR is required"; exit 1)
+	@test -n "$(MODEL_TAG)" || (echo "ERROR: MODEL_TAG is required"; exit 1)
+
+smoke-slurm: _require-model-vars
+	MODEL_DIR="$(MODEL_DIR)" madengine build \
+		--tags "$(MODEL_TAG)" \
+		--additional-context-file "$(SLURM_SMOKE_CONFIG)" \
+		--manifest-output "$(SLURM_SMOKE_MANIFEST)"
+	MODEL_DIR="$(MODEL_DIR)" madengine run \
+		--manifest-file "$(SLURM_SMOKE_MANIFEST)" \
+		--timeout 3600
+
+smoke-k8s: _require-model-vars
+	MODEL_DIR="$(MODEL_DIR)" madengine build \
+		--tags "$(MODEL_TAG)" \
+		--additional-context-file "$(K8S_SMOKE_CONFIG)" \
+		--manifest-output "$(K8S_SMOKE_MANIFEST)"
+	MODEL_DIR="$(MODEL_DIR)" madengine run \
+		--manifest-file "$(K8S_SMOKE_MANIFEST)" \
+		--timeout 3600
+
+smoke-slurm-verify:
+	python3 - <<'PY'
+import glob, json
+health = glob.glob("slurm_results/cluster_artifacts/**/gcm_health_summary.json", recursive=True)
+collector = glob.glob("slurm_results/cluster_artifacts/**/gcm_collector_output.log", recursive=True)
+rdma = glob.glob("slurm_results/**/rdma_recommendation.json", recursive=True)
+print("gcm health:", health)
+print("gcm collector:", collector)
+print("rdma artifacts:", rdma)
+for p in rdma[:3]:
+    data = json.load(open(p))
+    print(p, data.get("status"))
+PY
+
+smoke-k8s-verify:
+	python3 - <<'PY'
+import glob, json
+rdma = glob.glob("k8s_results/**/rdma_recommendation.json", recursive=True)
+print("rdma artifacts:", rdma)
+for p in rdma[:3]:
+    data = json.load(open(p))
+    print(p, data.get("status"))
+PY
@@ -0,0 +1,135 @@
+# Cluster Smoke Checklist (RDMA + GCM Phase 1)
+
+This checklist validates:
+
+- RDMA recommender on **SLURM + Kubernetes**
+- GCM health checks + one-shot collector on **SLURM only**
+
+If you prefer one-liners, use:
+
+```bash
+make -f examples/Makefile.smoke smoke-slurm MODEL_DIR=<path> MODEL_TAG=<tag>
+make -f examples/Makefile.smoke smoke-k8s MODEL_DIR=<path> MODEL_TAG=<tag>
+```
+
+Or use the wrapper script:
+
+```bash
+examples/run-smoke.sh slurm MODEL_DIR=<path> MODEL_TAG=<tag>
+examples/run-smoke.sh verify-slurm
+examples/run-smoke.sh k8s MODEL_DIR=<path> MODEL_TAG=<tag>
+examples/run-smoke.sh verify-k8s
+```
+
+## 0) Set shared variables
+
+```bash
+cd /home/ysha/amd/madengine
-cd /home/ysha/amd/madengine
+cd <repo-root>
-cd /home/ysha/amd/madengine
+cd <repo-root>
+export MODEL_DIR="<path-to-your-model-dir>"
+export MODEL_TAG="<your-model-tag>"
+```
+
+---
+
+## 1) SLURM smoke (RDMA + GCM)
+
+### 1.1 Build
+
+```bash
+madengine build \
+  --tags "${MODEL_TAG}" \
+  --additional-context-file "examples/slurm-configs/configs/smoke-rdma-gcm-slurm.json" \
+  --manifest-output "build_manifest.slurm.smoke.json"
+```
+
+### 1.2 Run
+
+```bash
+madengine run \
+  --manifest-file "build_manifest.slurm.smoke.json" \
+  --timeout 3600
+```
+
+### 1.3 Verify artifacts
+
+```bash
+# GCM health summary
+python3 - <<'PY'
+import glob, json, os
+matches = glob.glob("slurm_results/cluster_artifacts/**/gcm_health_summary.json", recursive=True)
+print("gcm_health_summary files:", matches)
+for p in matches[:2]:
+    print(p, json.load(open(p)).get("status"))
+PY
+
+# GCM collector output
+python3 - <<'PY'
+import glob
+matches = glob.glob("slurm_results/cluster_artifacts/**/gcm_collector_output.log", recursive=True)
+print("gcm_collector_output files:", matches)
+PY
+
+# RDMA artifacts copied per node collection directory
+python3 - <<'PY'
+import glob, json
+matches = glob.glob("slurm_results/**/rdma_recommendation.json", recursive=True)
+print("rdma_recommendation files:", matches)
+for p in matches[:3]:
+    data = json.load(open(p))
+    print(p, data.get("status"), sorted((data.get("recommended_env") or {}).keys())[:6])
+PY
+```
+
+---
+
+## 2) Kubernetes smoke (RDMA only)
+
+### 2.1 Build
+
+```bash
+madengine build \
+  --tags "${MODEL_TAG}" \
+  --additional-context-file "examples/k8s-configs/configs/smoke-rdma-k8s.json" \
+  --manifest-output "build_manifest.k8s.smoke.json"
+```
+
+### 2.2 Run
+
+```bash
+madengine run \
+  --manifest-file "build_manifest.k8s.smoke.json" \
+  --timeout 3600
+```
+
+### 2.3 Verify artifacts
+
+```bash
+python3 - <<'PY'
+import glob, json
+matches = glob.glob("k8s_results/**/rdma_recommendation.json", recursive=True)
+print("rdma_recommendation files:", matches)
+for p in matches[:3]:
+    data = json.load(open(p))
+    print(p, data.get("status"), sorted((data.get("recommended_env") or {}).keys())[:6])
+PY
+```
+
+---
+
+## 3) Optional strict-mode gate checks
+
+### 3.1 SLURM GCM strict gate
+
+Set in `examples/slurm-configs/configs/smoke-rdma-gcm-slurm.json`:
+
+- `cluster.gcm.strict: true`
+
+Then rerun section 1. If health checks fail, submission should fail early.
+
+### 3.2 RDMA strict gate
+
+Set in smoke config(s):
+
+- `cluster.rdma.strict: true`
+
+Then rerun section 1 or 2. Workload should fail when RDMA recommendation cannot be produced.
@@ -146,6 +146,24 @@ MODEL_DIR=tests/fixtures/dummy madengine run \
   --live-output
 ```
 
+## 🧪 Cluster Feature Smoke Config (RDMA)
+
+Use this phase-1 smoke config to validate RDMA recommender behavior on Kubernetes.
+(`cluster.gcm` remains SLURM-only in this phase.)
+
+Config file:
+
+- `examples/k8s-configs/configs/smoke-rdma-k8s.json`
+
+One-line runner:
+
+```bash
+examples/run-smoke.sh k8s MODEL_DIR=/path/to/model MODEL_TAG=your_tag
+examples/run-smoke.sh verify-k8s
+```
+
+For full command-by-command verification, see `examples/cluster-smoke-checklist.md`.
+
 ---
 
 ## 📁 Available Configurations