Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,18 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

- **E2E tests — `test_docker_gpus` pre-script OOM on MI350X**: The `run_rocenv_tool.sh` system-env pre-script was being OOM-killed (exit 137) inside Docker on gfx950 nodes with 6 GPUs bound, failing a test whose purpose is only GPU binding verification. Fixed by correcting the `gen_sys_env_details` condition in `container_runner.py` — the old `or` made the context key a no-op since `generate_sys_env_details` defaults to `True` — and passing `gen_sys_env_details: False` in the test's `additional_context`.

- **`--timeout 0` crashing with `signal.alarm(None)`**: `Timeout.__enter__` called `signal.alarm(None)` when `--timeout 0` was passed because the CLI correctly maps `0 → None` but `Timeout` had no guard for a falsy value. Added early-return in `__enter__`/`__exit__` when `seconds` is `None` or `0`. Also fixed the run command panels printing `Nones` for timeout when `--timeout 0` was used; they now display `disabled`.

- **Docker container name regex false positives**: The `docker ps --filter name=^/<name>$` exact-match filter embedded the container name directly into the regex without escaping, so names containing metacharacters (e.g. `.`, `[`) could match unintended containers. Applied `re.escape()` to the name before building the filter pattern.

- **`login_to_registry` type annotation**: The `registry` parameter was typed as `str` but the implementation handled `None` and callers (including tests) passed `None` to mean DockerHub. Corrected to `Optional[str]`.

- **Registry password process-list exposure**: `docker login` was invoked with the password in the argument list (visible via `/proc` or `ps`). Changed to pass it via a `MAD_REGISTRY_PASSWORD` environment variable consumed through `printf %s "$MAD_REGISTRY_PASSWORD" | docker login --password-stdin`.

- **`login_to_registry` — `raise_on_failure` not fully honoured**: Missing-key and invalid-format errors in `login_to_registry` always raised `RuntimeError` regardless of `raise_on_failure`. All three failure paths (missing registry key, invalid credential format, docker login error) are now gated on `raise_on_failure`, allowing `ContainerRunner` to fall through to public image pulls.

- **Kubernetes missing-package warning invisible**: `DeploymentFactory` raised `ImportWarning` when the `kubernetes` package was absent, which Python silences by default. Changed to `UserWarning` so the install hint is always visible.

### Changed

- **Model discovery — scope-based tag selection**: Replaced the `strict` mode flag on `DiscoverModels` with a cleaner scope-based rule that applies uniformly to both `madengine run` and `madengine build`:
Expand All @@ -33,6 +45,28 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- `--tags all` and `--tags scope/all` continue to select all models globally or within a scope respectively.
- Removed `strict_discovery` parameter from `BuildOrchestrator.execute()` and the corresponding call in `RunOrchestrator._build_phase()` as they are no longer needed.

- **Shared `login_to_registry` utility**: Extracted duplicated Docker registry login logic (~120 lines) from `DockerBuilder` and `ContainerRunner` into `core/auth.py::login_to_registry()`. Both classes now delegate to it. `DockerBuilder` uses `raise_on_failure=True`; `ContainerRunner` uses `raise_on_failure=False` to allow fallback to public images.

- **Centralised credential loading**: Extracted `_load_credentials` from `BuildOrchestrator` and `RunOrchestrator` into `core/auth.py::load_credentials()`. Environment variables (`MAD_DOCKERHUB_USER`, `MAD_DOCKERHUB_PASSWORD`, `MAD_DOCKERHUB_REPO`) take precedence over `credential.json`.

- **Dead code removal**: Removed unused functions `find_and_replace_pattern` and `substring_found` (`utils/ops.py`), `highlight_log_section` (`utils/log_formatting.py`), `SessionTracker.get_session_start` and `SessionTracker.load_marker` (`utils/session_tracker.py`), and the unused `_filter_images_by_dockerfile_context` method from `RunOrchestrator`.

- **`ConfigurationError` instead of `SystemExit` in orchestrator config loading**: `BuildOrchestrator` now raises a structured `ConfigurationError` (with suggestions) instead of calling `sys.exit()` directly when configuration loading fails.

### Security

- **Registry password no longer in process argument list**: Docker login commands previously passed the password as a CLI argument visible to other users via `/proc` or `ps`. All registry logins now inject the password through a dedicated `MAD_REGISTRY_PASSWORD` environment variable and use `--password-stdin`.

- **`build-arg` values shell-quoted**: All Docker `--build-arg` key/value pairs are now wrapped with `str()` before `shlex.quote()` to prevent shell injection from non-string config values.

### Tests

- **New `TestTimeout` suite**: Covers `None`, `0`, and positive-second cases for `Timeout.__enter__`/`__exit__`, plus a `resolve_run_timeout` passthrough regression test.

- **New `TestLoginToRegistry` suite**: Covers all success and failure paths of `login_to_registry`, including `raise_on_failure=True/False` behaviour, missing registry key, invalid credential format, and `docker.io` normalisation.

- **Test suite cleanup**: Removed dead imports across 14 test files; replaced `try/assert False/except` antipattern with `pytest.raises()` (with `match=`); narrowed 5 bare `except:` clauses to `except Exception:`; deleted a pass-only dead test; removed duplicate tests; reclassified `test_profiling_tools_config.py` from unit to integration (reads real disk files) and `test_errors.py` from integration to unit (pure mocks).

## [2.0.0] - 2026-04-09

### Overview
Expand Down
129 changes: 129 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Development Setup

```bash
# Install in development mode with all dependencies
pip install -e ".[dev]"

# Optional: install Kubernetes support
pip install -e ".[all]"

# Setup pre-commit hooks
pre-commit install
```

## Commands

```bash
# Run all tests
pytest

# Run specific test file
pytest tests/unit/test_error_handling.py -v

# Run specific test class or function
pytest tests/unit/test_error_handling.py::TestErrorPatternMatching -v

# Run tests with coverage
pytest --cov=src/madengine --cov-report=html

# Skip slow tests
pytest -m "not slow"

# Format code
black src/ tests/
isort src/ tests/

# Lint
flake8 src/ tests/

# Type check
mypy src/madengine

# Run all pre-commit checks
pre-commit run --all-files
```

## Architecture

madengine is a CLI tool for running AI/ML models in local Docker, Kubernetes, and SLURM environments. The entry point is `madengine.cli.app:cli_main` (registered as the `madengine` console script).

### Layer Structure

**CLI Layer** (`src/madengine/cli/`)
- `app.py` — Typer app wiring, registers 5 commands: `discover`, `build`, `run`, `report`, `database`
- `commands/` — One file per command (build, run, discover, report, database)
- `constants.py` — `ExitCode` enum (`SUCCESS=0`, `FAILURE=1`, `BUILD_FAILURE=2`, `RUN_FAILURE=3`, `INVALID_ARGS=4`)

**Orchestration Layer** (`src/madengine/orchestration/`)
- `build_orchestrator.py` — `BuildOrchestrator`: discovers models, builds Docker images, writes `build_manifest.json`
- `run_orchestrator.py` — `RunOrchestrator`: reads or triggers builds, infers deployment target, delegates to local or distributed execution

**Core Layer** (`src/madengine/core/`)
- `context.py` — `Context` class: merges `additional_context` with system detection (GPU vendor, architecture, OS, ROCm path). Uses `ast.literal_eval()` to parse additional_context strings (not `json.loads` — pass Python dict repr, not JSON)
- `console.py` — `Console`: shell execution wrapper with live output support
- `docker.py` — Docker command wrapper

**Execution Layer** (`src/madengine/execution/`)
- `container_runner.py` — `ContainerRunner`: runs models from manifest via `docker run`, writes results to `perf.csv`
- `docker_builder.py` — `DockerBuilder`: builds images from Dockerfiles
- `container_runner_helpers.py` — Log error pattern scanning, timeout resolution

**Deployment Layer** (`src/madengine/deployment/`)
- `factory.py` — `DeploymentFactory`: Factory pattern, registers `SlurmDeployment` and `KubernetesDeployment`
- `base.py` — `BaseDeployment` abstract class, `DeploymentConfig` dataclass
- `kubernetes.py` / `slurm.py` — Concrete deployments; target is inferred by Convention over Configuration: presence of `"k8s"` or `"kubernetes"` key → K8s; `"slurm"` key → SLURM; neither → local
- `presets/` — JSON preset files for K8s/SLURM default configurations; auto-merged with minimal user configs
- `config_loader.py` — Loads and merges preset JSON with user-supplied config

**Utils** (`src/madengine/utils/`)
- `discover_models.py` — `DiscoverModels`: three discovery methods: root `models.json`, `scripts/{dir}/models.json`, or `scripts/{dir}/get_models_json.py` (dynamic)
- `gpu_tool_factory.py` / `gpu_tool_manager.py` — GPU vendor abstraction (AMD/NVIDIA)
- `gpu_validator.py` — ROCm installation detection, GPU vendor detection
- `config_parser.py` — `ConfigParser`: parses `--additional-context` and tools config

**Reporting** (`src/madengine/reporting/`)
- `update_perf_csv.py` — Writes/appends to `perf.csv` and `perf_entry.csv`
- `csv_to_html.py` / `csv_to_email.py` — Report generation

### Key Data Flows

1. **Build flow**: CLI → `BuildOrchestrator` → `DiscoverModels` (finds models by tags) → `DockerBuilder` (builds images) → writes `build_manifest.json`

2. **Run flow**: CLI → `RunOrchestrator` → loads/generates `build_manifest.json` → infers target → `ContainerRunner` (local) or `DeploymentFactory` (K8s/SLURM) → writes `perf.csv`

3. **`additional_context`**: User JSON/Python-dict string merged into `Context.ctx`. Context is parsed with `ast.literal_eval()`, so values can use Python dict syntax. Keys like `k8s`, `slurm`, `distributed`, `tools`, `pre_scripts`, `post_scripts` drive behavior.

4. **Model definition**: Models defined in `models.json` with fields: `name`, `tags`, `dockerfile`, `scripts`, `n_gpus`, `args`, `timeout`, `skip_gpu_arch`, etc.

5. **Script isolation**: During run, `scripts/common/` is populated from the madengine package (pre_scripts, post_scripts, tools) and cleaned up afterwards. The MAD project's own `scripts/` and `docker/` directories are preserved.

### Deployment Target Inference

No explicit `"deploy"` field is needed. Target is inferred from config structure:
- `"k8s"` or `"kubernetes"` key present → Kubernetes deployment
- `"slurm"` key present → SLURM deployment
- Neither → local Docker execution

### Test Structure

```
tests/
├── unit/ # Fast isolated tests with mocking
├── integration/ # End-to-end with real Docker/system calls
├── e2e/ # Full workflow tests
└── fixtures/ # Dummy models, scripts, and data for testing
```

Pytest config is in `pyproject.toml` under `[tool.pytest.ini_options]`. Test markers: `slow`, `integration`.

### Code Style

- Black formatting, 88-character line length
- isort with `profile = "black"`
- Google-style docstrings
- Type hints required for public functions
- Conventional commits: `feat:`, `fix:`, `docs:`, `test:`, `refactor:`, `style:`, `perf:`, `chore:`
2 changes: 0 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,8 @@ dependencies = [
"GitPython",
"jsondiff",
"sqlalchemy",
"setuptools-rust",
"paramiko",
"tqdm",
"pytest",
"typing-extensions",
"pymongo",
"toml",
Expand Down
8 changes: 6 additions & 2 deletions src/madengine/cli/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
"""

import sys
from importlib.metadata import PackageNotFoundError, version as pkg_version

import typer
from rich.traceback import install
Expand Down Expand Up @@ -55,9 +56,12 @@ def main(
Built with Typer and Rich for a beautiful, production-ready experience.
"""
if version:
# You might want to get the actual version from your package
try:
_version = pkg_version("madengine")
except PackageNotFoundError:
_version = "unknown"
console.print(
"🚀 [bold cyan]madengine[/bold cyan] version [green]2.0.0[/green]"
f"🚀 [bold cyan]madengine[/bold cyan] version [green]{_version}[/green]"
)
raise typer.Exit()

Expand Down
11 changes: 8 additions & 3 deletions src/madengine/cli/commands/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,7 @@ def run(
# Input validation
if timeout < -1:
console.print(
"❌ [red]Timeout must be -1 (default) or a positive integer[/red]"
"❌ [red]Timeout must be -1 (default), 0 (no timeout), or a positive integer[/red]"
)
raise typer.Exit(ExitCode.INVALID_ARGS)

Expand All @@ -194,6 +194,11 @@ def run(
# Convert -1 (default) to actual default timeout value (7200 seconds = 2 hours)
if timeout == -1:
timeout = 7200
# 0 means "no timeout" per the help text — map to None so subprocess never expires
elif timeout == 0:
timeout = None
Comment thread
coketaste marked this conversation as resolved.

timeout_display = "disabled" if timeout is None else f"{timeout}s"

try:
# Check if we're doing execution-only or full workflow
Expand All @@ -211,7 +216,7 @@ def run(
f"🚀 [bold cyan]Running Models (Execution Only)[/bold cyan]\n"
f"Manifest: [yellow]{manifest_file}[/yellow]\n"
f"Registry: [yellow]{registry or 'Auto-detected'}[/yellow]\n"
f"Timeout: [yellow]{timeout if timeout != -1 else 'Default'}[/yellow]s",
f"Timeout: [yellow]{timeout_display}[/yellow]",
title="Execution Configuration",
border_style="green",
)
Expand Down Expand Up @@ -314,7 +319,7 @@ def run(
f"🔨🚀 [bold cyan]Complete Workflow (Build + Run)[/bold cyan]\n"
f"Tags: [yellow]{', '.join(processed_tags) if processed_tags else 'All models'}[/yellow]\n"
f"Registry: [yellow]{registry or 'Local only'}[/yellow]\n"
f"Timeout: [yellow]{timeout if timeout != -1 else 'Default'}[/yellow]s"
f"Timeout: [yellow]{timeout_display}[/yellow]"
f"{skip_note}",
title="Workflow Configuration",
border_style="magenta",
Expand Down
6 changes: 4 additions & 2 deletions src/madengine/cli/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,13 @@
Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
"""

from enum import IntEnum


# Exit codes
class ExitCode:
class ExitCode(IntEnum):
"""Exit codes for CLI commands."""

SUCCESS = 0
FAILURE = 1
BUILD_FAILURE = 2
Expand Down
4 changes: 3 additions & 1 deletion src/madengine/cli/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -425,8 +425,10 @@ def format_performance(perf):
return f"{val:,.0f}"
elif val >= 10:
return f"{val:.1f}"
elif val >= 0.01:
return f"{val:.4f}"
else:
return f"{val:.2f}"
return f"{val:.4g}"
except (ValueError, TypeError):
return str(perf)

Expand Down
2 changes: 2 additions & 0 deletions src/madengine/cli/validators.py
Original file line number Diff line number Diff line change
Expand Up @@ -395,6 +395,8 @@ def process_batch_manifest_entries(

# If the model was not built (build_new=false), create an entry for it
if not build_new:
# Initialize with a safe fallback so the except block can always reference it
dockerfile_matched = "unknown"
# Find the model configuration by discovering models with this tag
try:
# Create a temporary args object to discover the model
Expand Down
Loading