Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ madengine is a modern CLI tool for running Large Language Models (LLMs) and Deep

- [Key Features](#-key-features)
- [Quick Start](#-quick-start)
- [Smoke Testing](#-smoke-testing)
- [Commands](#-commands)
- [Documentation](#-documentation)
- [Architecture](#-architecture)
Expand Down Expand Up @@ -80,6 +81,31 @@ madengine run --tags dummy --rocm-path /path/to/rocm

**Results:** Performance data is written to `perf.csv` (and optionally `perf_entry.csv`). The file is created automatically if missing. Failed runs (including pre-run setup failures) are recorded with status `FAILURE` so every attempted model appears in the table. See [Exit Codes](docs/cli-reference.md#exit-codes) for CI/script usage.

## 🧪 Smoke Testing

Use the prebuilt smoke configs and wrapper script under `examples/` to quickly validate:

- RDMA recommender on SLURM + Kubernetes
- GCM preflight/collector on SLURM (phase-1 scope)

```bash
# SLURM smoke (build + run) + artifact verification
examples/run-smoke.sh slurm MODEL_DIR=/path/to/model MODEL_TAG=your_tag
examples/run-smoke.sh verify-slurm

# Kubernetes smoke (build + run) + artifact verification
examples/run-smoke.sh k8s MODEL_DIR=/path/to/model MODEL_TAG=your_tag
examples/run-smoke.sh verify-k8s
```

Smoke assets:

- `examples/run-smoke.sh`
- `examples/Makefile.smoke`
- `examples/slurm-configs/configs/smoke-rdma-gcm-slurm.json`
- `examples/k8s-configs/configs/smoke-rdma-k8s.json`
- `examples/cluster-smoke-checklist.md`

## 📋 Commands

madengine provides five main commands for model automation and benchmarking:
Expand Down
87 changes: 87 additions & 0 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,93 @@ madengine run --tags my_unit_test_suite \

Disabling the scan does **not** change performance metric extraction from the log; it only affects the post-hoc grep used to set `has_errors` for status.

## Cluster Feature Layer (`additional_context.cluster`)

`cluster` is an additive feature-flag namespace for RDMA and (SLURM-only) GCM integration.
Nothing changes unless you explicitly set `cluster.*.enabled: true`.

### Schema

```json
{
"cluster": {
"rdma": {
"enabled": false,
"strict": false,
"mode": "recommend",
"apply_env": true,
"artifact_name": "rdma_recommendation.json"
},
"gcm": {
"enabled": false,
"enabled_platforms": ["slurm"],
"source": {
"repo": "https://github.com/coketaste/gcm",
"ref": "9fed02cd0721d3937f8749672951185f31955bd4"
},
"strict": false,
"health_checks": ["check-hca", "check-ibstat"],
"collector": {
"enabled": false,
"command": "slurm_job_monitor",
"once": true,
"sink": "file",
"timeout_sec": 120,
"max_retries": 1,
"best_effort": true
},
"artifacts": {
"dir": "./slurm_results/cluster_artifacts",
"files": {
"health_summary_json": "gcm_health_summary.json",
"health_raw_log": "gcm_health_raw.log",
"collector_output": "gcm_collector_output.log"
}
}
}
}
}
```

### RDMA behavior (SLURM + Kubernetes)

- `mode: "recommend"` keeps user `env_vars` precedence; only missing RDMA vars are injected.
- `mode: "enforce"` lets recommender output override existing conflicting RDMA env vars.
- `strict: true` fails the workload when no valid RDMA recommendation can be produced.
- Artifacts are written per node/pod and included in deployment result summaries.

### GCM behavior (SLURM only in this phase)

- Health checks run in preflight (`check-hca`, `check-ibstat` allowlist only).
- `strict: true` gates submission on health-check failures; `strict: false` warns and continues.
- Collector runs as one-shot `gcm slurm_job_monitor --once` during result collection.
- Collector defaults to best effort (`best_effort: true`) and does not gate workload success.
- Source is pinned to `coketaste/gcm` with fixed commit ref for reproducibility checks.

### Rollout guidance

1. Start with `cluster.rdma.enabled=true`, `strict=false`, `mode="recommend"`.
2. Validate RDMA artifacts and selected env vars on single-node, then multi-node.
3. Enable `cluster.gcm.enabled=true` with `strict=false` to observe health output.
4. Turn on `cluster.gcm.strict=true` only after cluster baseline is stable.
5. Keep collector best-effort initially; tighten only after runtime overhead is validated.

### Smoke configs and one-line runner

Prebuilt smoke configs are available at:

- `examples/slurm-configs/configs/smoke-rdma-gcm-slurm.json`
- `examples/k8s-configs/configs/smoke-rdma-k8s.json`

Run them with:

```bash
examples/run-smoke.sh slurm MODEL_DIR=/path/to/model MODEL_TAG=your_tag
examples/run-smoke.sh k8s MODEL_DIR=/path/to/model MODEL_TAG=your_tag
```

Artifact verification commands are documented in `examples/cluster-smoke-checklist.md`.

## Basic Configuration

**gpu_vendor** (case-insensitive):
Expand Down
36 changes: 36 additions & 0 deletions docs/deployment.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Deployment Guide

Deploy madengine workloads to Kubernetes or SLURM clusters for distributed execution.
For quick end-to-end validation commands, see the README [Smoke Testing](../README.md#-smoke-testing) section.

## Overview

Expand All @@ -11,6 +12,41 @@ madengine supports two deployment backends:

Deployment is configured via `--additional-context` and happens automatically during the run phase.

## Cluster Feature Stages

When `additional_context.cluster` is enabled, stage placement is:

- **RDMA recommender**: runtime stage on both SLURM and Kubernetes (before workload launch).
- **GCM health checks**: SLURM preflight stage before `sbatch` submission.
- **GCM collector snapshot**: SLURM result-collection stage (post-run, one-shot).

Current scope is intentionally phased:

- RDMA: **SLURM + Kubernetes**
- GCM: **SLURM only**

## Cluster Smoke Runner

For quick validation of the new cluster feature layer, use the smoke assets under `examples/`:

- Configs:
- `examples/slurm-configs/configs/smoke-rdma-gcm-slurm.json`
- `examples/k8s-configs/configs/smoke-rdma-k8s.json`
- One-line wrapper:
- `examples/run-smoke.sh`
- Full checklist:
- `examples/cluster-smoke-checklist.md`

Example:

```bash
examples/run-smoke.sh slurm MODEL_DIR=/path/to/model MODEL_TAG=your_tag
examples/run-smoke.sh verify-slurm

examples/run-smoke.sh k8s MODEL_DIR=/path/to/model MODEL_TAG=your_tag
examples/run-smoke.sh verify-k8s
```

## Deployment Workflow

```
Expand Down
53 changes: 53 additions & 0 deletions examples/Makefile.smoke
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
.PHONY: smoke-slurm smoke-k8s smoke-slurm-verify smoke-k8s-verify

SLURM_SMOKE_CONFIG ?= examples/slurm-configs/configs/smoke-rdma-gcm-slurm.json
K8S_SMOKE_CONFIG ?= examples/k8s-configs/configs/smoke-rdma-k8s.json

SLURM_SMOKE_MANIFEST ?= build_manifest.slurm.smoke.json
K8S_SMOKE_MANIFEST ?= build_manifest.k8s.smoke.json

_require-model-vars:
@test -n "$(MODEL_DIR)" || (echo "ERROR: MODEL_DIR is required"; exit 1)
@test -n "$(MODEL_TAG)" || (echo "ERROR: MODEL_TAG is required"; exit 1)

smoke-slurm: _require-model-vars
MODEL_DIR="$(MODEL_DIR)" madengine build \
--tags "$(MODEL_TAG)" \
--additional-context-file "$(SLURM_SMOKE_CONFIG)" \
--manifest-output "$(SLURM_SMOKE_MANIFEST)"
MODEL_DIR="$(MODEL_DIR)" madengine run \
--manifest-file "$(SLURM_SMOKE_MANIFEST)" \
--timeout 3600

smoke-k8s: _require-model-vars
MODEL_DIR="$(MODEL_DIR)" madengine build \
--tags "$(MODEL_TAG)" \
--additional-context-file "$(K8S_SMOKE_CONFIG)" \
--manifest-output "$(K8S_SMOKE_MANIFEST)"
MODEL_DIR="$(MODEL_DIR)" madengine run \
--manifest-file "$(K8S_SMOKE_MANIFEST)" \
--timeout 3600

smoke-slurm-verify:
python3 - <<'PY'
import glob, json
health = glob.glob("slurm_results/cluster_artifacts/**/gcm_health_summary.json", recursive=True)
collector = glob.glob("slurm_results/cluster_artifacts/**/gcm_collector_output.log", recursive=True)
rdma = glob.glob("slurm_results/**/rdma_recommendation.json", recursive=True)
print("gcm health:", health)
print("gcm collector:", collector)
print("rdma artifacts:", rdma)
for p in rdma[:3]:
data = json.load(open(p))
print(p, data.get("status"))
PY

smoke-k8s-verify:
python3 - <<'PY'
import glob, json
rdma = glob.glob("k8s_results/**/rdma_recommendation.json", recursive=True)
print("rdma artifacts:", rdma)
for p in rdma[:3]:
data = json.load(open(p))
print(p, data.get("status"))
PY
135 changes: 135 additions & 0 deletions examples/cluster-smoke-checklist.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
# Cluster Smoke Checklist (RDMA + GCM Phase 1)

This checklist validates:

- RDMA recommender on **SLURM + Kubernetes**
- GCM health checks + one-shot collector on **SLURM only**

If you prefer one-liners, use:

```bash
make -f examples/Makefile.smoke smoke-slurm MODEL_DIR=<path> MODEL_TAG=<tag>
make -f examples/Makefile.smoke smoke-k8s MODEL_DIR=<path> MODEL_TAG=<tag>
```

Or use the wrapper script:

```bash
examples/run-smoke.sh slurm MODEL_DIR=<path> MODEL_TAG=<tag>
examples/run-smoke.sh verify-slurm
examples/run-smoke.sh k8s MODEL_DIR=<path> MODEL_TAG=<tag>
examples/run-smoke.sh verify-k8s
```

## 0) Set shared variables

```bash
cd /home/ysha/amd/madengine
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checklist hard-codes a developer-specific path (/home/ysha/amd/madengine). Use a generic instruction (e.g., cd <repo-root>) so the doc is usable by others and in CI environments.

Suggested change
cd /home/ysha/amd/madengine
cd <repo-root>

Copilot uses AI. Check for mistakes.
export MODEL_DIR="<path-to-your-model-dir>"
export MODEL_TAG="<your-model-tag>"
```

---

## 1) SLURM smoke (RDMA + GCM)

### 1.1 Build

```bash
madengine build \
--tags "${MODEL_TAG}" \
--additional-context-file "examples/slurm-configs/configs/smoke-rdma-gcm-slurm.json" \
--manifest-output "build_manifest.slurm.smoke.json"
```

### 1.2 Run

```bash
madengine run \
--manifest-file "build_manifest.slurm.smoke.json" \
--timeout 3600
```

### 1.3 Verify artifacts

```bash
# GCM health summary
python3 - <<'PY'
import glob, json, os
matches = glob.glob("slurm_results/cluster_artifacts/**/gcm_health_summary.json", recursive=True)
print("gcm_health_summary files:", matches)
for p in matches[:2]:
print(p, json.load(open(p)).get("status"))
PY

# GCM collector output
python3 - <<'PY'
import glob
matches = glob.glob("slurm_results/cluster_artifacts/**/gcm_collector_output.log", recursive=True)
print("gcm_collector_output files:", matches)
PY

# RDMA artifacts copied per node collection directory
python3 - <<'PY'
import glob, json
matches = glob.glob("slurm_results/**/rdma_recommendation.json", recursive=True)
print("rdma_recommendation files:", matches)
for p in matches[:3]:
data = json.load(open(p))
print(p, data.get("status"), sorted((data.get("recommended_env") or {}).keys())[:6])
PY
```

---

## 2) Kubernetes smoke (RDMA only)

### 2.1 Build

```bash
madengine build \
--tags "${MODEL_TAG}" \
--additional-context-file "examples/k8s-configs/configs/smoke-rdma-k8s.json" \
--manifest-output "build_manifest.k8s.smoke.json"
```

### 2.2 Run

```bash
madengine run \
--manifest-file "build_manifest.k8s.smoke.json" \
--timeout 3600
```

### 2.3 Verify artifacts

```bash
python3 - <<'PY'
import glob, json
matches = glob.glob("k8s_results/**/rdma_recommendation.json", recursive=True)
print("rdma_recommendation files:", matches)
for p in matches[:3]:
data = json.load(open(p))
print(p, data.get("status"), sorted((data.get("recommended_env") or {}).keys())[:6])
PY
```

---

## 3) Optional strict-mode gate checks

### 3.1 SLURM GCM strict gate

Set in `examples/slurm-configs/configs/smoke-rdma-gcm-slurm.json`:

- `cluster.gcm.strict: true`

Then rerun section 1. If health checks fail, submission should fail early.

### 3.2 RDMA strict gate

Set in smoke config(s):

- `cluster.rdma.strict: true`

Then rerun section 1 or 2. Workload should fail when RDMA recommendation cannot be produced.
18 changes: 18 additions & 0 deletions examples/k8s-configs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,24 @@ MODEL_DIR=tests/fixtures/dummy madengine run \
--live-output
```

## 🧪 Cluster Feature Smoke Config (RDMA)

Use this phase-1 smoke config to validate RDMA recommender behavior on Kubernetes.
(`cluster.gcm` remains SLURM-only in this phase.)

Config file:

- `examples/k8s-configs/configs/smoke-rdma-k8s.json`

One-line runner:

```bash
examples/run-smoke.sh k8s MODEL_DIR=/path/to/model MODEL_TAG=your_tag
examples/run-smoke.sh verify-k8s
```

For full command-by-command verification, see `examples/cluster-smoke-checklist.md`.

---

## 📁 Available Configurations
Expand Down
Loading