ROCm · coketaste · Apr 18, 2026 · Apr 18, 2026 · Apr 20, 2026 · Apr 20, 2026
@@ -25,6 +25,18 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 - **E2E tests — `test_docker_gpus` pre-script OOM on MI350X**: The `run_rocenv_tool.sh` system-env pre-script was being OOM-killed (exit 137) inside Docker on gfx950 nodes with 6 GPUs bound, failing a test whose purpose is only GPU binding verification. Fixed by correcting the `gen_sys_env_details` condition in `container_runner.py` — the old `or` made the context key a no-op since `generate_sys_env_details` defaults to `True` — and passing `gen_sys_env_details: False` in the test's `additional_context`.
 
+- **`--timeout 0` crashing with `signal.alarm(None)`**: `Timeout.__enter__` called `signal.alarm(None)` when `--timeout 0` was passed because the CLI correctly maps `0 → None` but `Timeout` had no guard for a falsy value. Added early-return in `__enter__`/`__exit__` when `seconds` is `None` or `0`. Also fixed the run command panels printing `Nones` for timeout when `--timeout 0` was used; they now display `disabled`.
+
+- **Docker container name regex false positives**: The `docker ps --filter name=^/<name>$` exact-match filter embedded the container name directly into the regex without escaping, so names containing metacharacters (e.g. `.`, `[`) could match unintended containers. Applied `re.escape()` to the name before building the filter pattern.
+
+- **`login_to_registry` type annotation**: The `registry` parameter was typed as `str` but the implementation handled `None` and callers (including tests) passed `None` to mean DockerHub. Corrected to `Optional[str]`.
+
+- **Registry password process-list exposure**: `docker login` was invoked with the password in the argument list (visible via `/proc` or `ps`). Changed to pass it via a `MAD_REGISTRY_PASSWORD` environment variable consumed through `printf %s "$MAD_REGISTRY_PASSWORD" | docker login --password-stdin`.
+
+- **`login_to_registry` — `raise_on_failure` not fully honoured**: Missing-key and invalid-format errors in `login_to_registry` always raised `RuntimeError` regardless of `raise_on_failure`. All three failure paths (missing registry key, invalid credential format, docker login error) are now gated on `raise_on_failure`, allowing `ContainerRunner` to fall through to public image pulls.
+
+- **Kubernetes missing-package warning invisible**: `DeploymentFactory` raised `ImportWarning` when the `kubernetes` package was absent, which Python silences by default. Changed to `UserWarning` so the install hint is always visible.
+
 ### Changed
 
 - **Model discovery — scope-based tag selection**: Replaced the `strict` mode flag on `DiscoverModels` with a cleaner scope-based rule that applies uniformly to both `madengine run` and `madengine build`:
@@ -33,6 +45,28 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
   - `--tags all` and `--tags scope/all` continue to select all models globally or within a scope respectively.
   - Removed `strict_discovery` parameter from `BuildOrchestrator.execute()` and the corresponding call in `RunOrchestrator._build_phase()` as they are no longer needed.
 
+- **Shared `login_to_registry` utility**: Extracted duplicated Docker registry login logic (~120 lines) from `DockerBuilder` and `ContainerRunner` into `core/auth.py::login_to_registry()`. Both classes now delegate to it. `DockerBuilder` uses `raise_on_failure=True`; `ContainerRunner` uses `raise_on_failure=False` to allow fallback to public images.
+
+- **Centralised credential loading**: Extracted `_load_credentials` from `BuildOrchestrator` and `RunOrchestrator` into `core/auth.py::load_credentials()`. Environment variables (`MAD_DOCKERHUB_USER`, `MAD_DOCKERHUB_PASSWORD`, `MAD_DOCKERHUB_REPO`) take precedence over `credential.json`.
+
+- **Dead code removal**: Removed unused functions `find_and_replace_pattern` and `substring_found` (`utils/ops.py`), `highlight_log_section` (`utils/log_formatting.py`), `SessionTracker.get_session_start` and `SessionTracker.load_marker` (`utils/session_tracker.py`), and the unused `_filter_images_by_dockerfile_context` method from `RunOrchestrator`.
+
+- **`ConfigurationError` instead of `SystemExit` in orchestrator config loading**: `BuildOrchestrator` now raises a structured `ConfigurationError` (with suggestions) instead of calling `sys.exit()` directly when configuration loading fails.
+
+### Security
+
+- **Registry password no longer in process argument list**: Docker login commands previously passed the password as a CLI argument visible to other users via `/proc` or `ps`. All registry logins now inject the password through a dedicated `MAD_REGISTRY_PASSWORD` environment variable and use `--password-stdin`.
+
+- **`build-arg` values shell-quoted**: All Docker `--build-arg` key/value pairs are now wrapped with `str()` before `shlex.quote()` to prevent shell injection from non-string config values.
+
+### Tests
+
+- **New `TestTimeout` suite**: Covers `None`, `0`, and positive-second cases for `Timeout.__enter__`/`__exit__`, plus a `resolve_run_timeout` passthrough regression test.
+
+- **New `TestLoginToRegistry` suite**: Covers all success and failure paths of `login_to_registry`, including `raise_on_failure=True/False` behaviour, missing registry key, invalid credential format, and `docker.io` normalisation.
+
+- **Test suite cleanup**: Removed dead imports across 14 test files; replaced `try/assert False/except` antipattern with `pytest.raises()` (with `match=`); narrowed 5 bare `except:` clauses to `except Exception:`; deleted a pass-only dead test; removed duplicate tests; reclassified `test_profiling_tools_config.py` from unit to integration (reads real disk files) and `test_errors.py` from integration to unit (pure mocks).
+
 ## [2.0.0] - 2026-04-09
 
 ### Overview

@@ -0,0 +1,129 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Development Setup
+
+```bash
+# Install in development mode with all dependencies
+pip install -e ".[dev]"
+
+# Optional: install Kubernetes support
+pip install -e ".[all]"
+
+# Setup pre-commit hooks
+pre-commit install
+```
+
+## Commands
+
+```bash
+# Run all tests
+pytest
+
+# Run specific test file
+pytest tests/unit/test_error_handling.py -v
+
+# Run specific test class or function
+pytest tests/unit/test_error_handling.py::TestErrorPatternMatching -v
+
+# Run tests with coverage
+pytest --cov=src/madengine --cov-report=html
+
+# Skip slow tests
+pytest -m "not slow"
+
+# Format code
+black src/ tests/
+isort src/ tests/
+
+# Lint
+flake8 src/ tests/
+
+# Type check
+mypy src/madengine
+
+# Run all pre-commit checks
+pre-commit run --all-files
+```
+
+## Architecture
+
+madengine is a CLI tool for running AI/ML models in local Docker, Kubernetes, and SLURM environments. The entry point is `madengine.cli.app:cli_main` (registered as the `madengine` console script).
+
+### Layer Structure
+
+**CLI Layer** (`src/madengine/cli/`)
+- `app.py` — Typer app wiring, registers 5 commands: `discover`, `build`, `run`, `report`, `database`
+- `commands/` — One file per command (build, run, discover, report, database)
+- `constants.py` — `ExitCode` enum (`SUCCESS=0`, `FAILURE=1`, `BUILD_FAILURE=2`, `RUN_FAILURE=3`, `INVALID_ARGS=4`)
+
+**Orchestration Layer** (`src/madengine/orchestration/`)
+- `build_orchestrator.py` — `BuildOrchestrator`: discovers models, builds Docker images, writes `build_manifest.json`
+- `run_orchestrator.py` — `RunOrchestrator`: reads or triggers builds, infers deployment target, delegates to local or distributed execution
+
+**Core Layer** (`src/madengine/core/`)
+- `context.py` — `Context` class: merges `additional_context` with system detection (GPU vendor, architecture, OS, ROCm path). Uses `ast.literal_eval()` to parse additional_context strings (not `json.loads` — pass Python dict repr, not JSON)
+- `console.py` — `Console`: shell execution wrapper with live output support
+- `docker.py` — Docker command wrapper
+
+**Execution Layer** (`src/madengine/execution/`)
+- `container_runner.py` — `ContainerRunner`: runs models from manifest via `docker run`, writes results to `perf.csv`
+- `docker_builder.py` — `DockerBuilder`: builds images from Dockerfiles
+- `container_runner_helpers.py` — Log error pattern scanning, timeout resolution
+
+**Deployment Layer** (`src/madengine/deployment/`)
+- `factory.py` — `DeploymentFactory`: Factory pattern, registers `SlurmDeployment` and `KubernetesDeployment`
+- `base.py` — `BaseDeployment` abstract class, `DeploymentConfig` dataclass
+- `kubernetes.py` / `slurm.py` — Concrete deployments; target is inferred by Convention over Configuration: presence of `"k8s"` or `"kubernetes"` key → K8s; `"slurm"` key → SLURM; neither → local
+- `presets/` — JSON preset files for K8s/SLURM default configurations; auto-merged with minimal user configs
+- `config_loader.py` — Loads and merges preset JSON with user-supplied config
+
+**Utils** (`src/madengine/utils/`)
+- `discover_models.py` — `DiscoverModels`: three discovery methods: root `models.json`, `scripts/{dir}/models.json`, or `scripts/{dir}/get_models_json.py` (dynamic)
+- `gpu_tool_factory.py` / `gpu_tool_manager.py` — GPU vendor abstraction (AMD/NVIDIA)
+- `gpu_validator.py` — ROCm installation detection, GPU vendor detection
+- `config_parser.py` — `ConfigParser`: parses `--additional-context` and tools config
+
+**Reporting** (`src/madengine/reporting/`)
+- `update_perf_csv.py` — Writes/appends to `perf.csv` and `perf_entry.csv`
+- `csv_to_html.py` / `csv_to_email.py` — Report generation
+
+### Key Data Flows
+
+1. **Build flow**: CLI → `BuildOrchestrator` → `DiscoverModels` (finds models by tags) → `DockerBuilder` (builds images) → writes `build_manifest.json`
+
+2. **Run flow**: CLI → `RunOrchestrator` → loads/generates `build_manifest.json` → infers target → `ContainerRunner` (local) or `DeploymentFactory` (K8s/SLURM) → writes `perf.csv`
+
+3. **`additional_context`**: User JSON/Python-dict string merged into `Context.ctx`. Context is parsed with `ast.literal_eval()`, so values can use Python dict syntax. Keys like `k8s`, `slurm`, `distributed`, `tools`, `pre_scripts`, `post_scripts` drive behavior.
+
+4. **Model definition**: Models defined in `models.json` with fields: `name`, `tags`, `dockerfile`, `scripts`, `n_gpus`, `args`, `timeout`, `skip_gpu_arch`, etc.
+
+5. **Script isolation**: During run, `scripts/common/` is populated from the madengine package (pre_scripts, post_scripts, tools) and cleaned up afterwards. The MAD project's own `scripts/` and `docker/` directories are preserved.
+
+### Deployment Target Inference
+
+No explicit `"deploy"` field is needed. Target is inferred from config structure:
+- `"k8s"` or `"kubernetes"` key present → Kubernetes deployment
+- `"slurm"` key present → SLURM deployment
+- Neither → local Docker execution
+
+### Test Structure
+
+```
+tests/
+├── unit/         # Fast isolated tests with mocking
+├── integration/  # End-to-end with real Docker/system calls
+├── e2e/          # Full workflow tests
+└── fixtures/     # Dummy models, scripts, and data for testing
+```
+
+Pytest config is in `pyproject.toml` under `[tool.pytest.ini_options]`. Test markers: `slow`, `integration`.
+
+### Code Style
+
+- Black formatting, 88-character line length
+- isort with `profile = "black"`
+- Google-style docstrings
+- Type hints required for public functions
+- Conventional commits: `feat:`, `fix:`, `docs:`, `test:`, `refactor:`, `style:`, `perf:`, `chore:`
@@ -16,10 +16,8 @@ dependencies = [
   "GitPython",
   "jsondiff",
   "sqlalchemy",
-  "setuptools-rust",
   "paramiko",
   "tqdm",
-  "pytest",
   "typing-extensions",
   "pymongo",
   "toml",

@@ -8,6 +8,7 @@
 """
 
 import sys
+from importlib.metadata import PackageNotFoundError, version as pkg_version
 
 import typer
 from rich.traceback import install
@@ -55,9 +56,12 @@ def main(
     Built with Typer and Rich for a beautiful, production-ready experience.
     """
     if version:
-        # You might want to get the actual version from your package
+        try:
+            _version = pkg_version("madengine")
+        except PackageNotFoundError:
+            _version = "unknown"
         console.print(
-            "🚀 [bold cyan]madengine[/bold cyan] version [green]2.0.0[/green]"
+            f"🚀 [bold cyan]madengine[/bold cyan] version [green]{_version}[/green]"
         )
         raise typer.Exit()
 

@@ -174,7 +174,7 @@ def run(
     # Input validation
     if timeout < -1:
         console.print(
-            "❌ [red]Timeout must be -1 (default) or a positive integer[/red]"
+            "❌ [red]Timeout must be -1 (default), 0 (no timeout), or a positive integer[/red]"
         )
         raise typer.Exit(ExitCode.INVALID_ARGS)
 
@@ -194,6 +194,11 @@ def run(
     # Convert -1 (default) to actual default timeout value (7200 seconds = 2 hours)
     if timeout == -1:
         timeout = 7200
+    # 0 means "no timeout" per the help text — map to None so subprocess never expires
+    elif timeout == 0:
+        timeout = None
+
+    timeout_display = "disabled" if timeout is None else f"{timeout}s"
 
     try:
         # Check if we're doing execution-only or full workflow
@@ -211,7 +216,7 @@ def run(
                     f"🚀 [bold cyan]Running Models (Execution Only)[/bold cyan]\n"
                     f"Manifest: [yellow]{manifest_file}[/yellow]\n"
                     f"Registry: [yellow]{registry or 'Auto-detected'}[/yellow]\n"
-                    f"Timeout: [yellow]{timeout if timeout != -1 else 'Default'}[/yellow]s",
+                    f"Timeout: [yellow]{timeout_display}[/yellow]",
                     title="Execution Configuration",
                     border_style="green",
                 )
@@ -314,7 +319,7 @@ def run(
                     f"🔨🚀 [bold cyan]Complete Workflow (Build + Run)[/bold cyan]\n"
                     f"Tags: [yellow]{', '.join(processed_tags) if processed_tags else 'All models'}[/yellow]\n"
                     f"Registry: [yellow]{registry or 'Local only'}[/yellow]\n"
-                    f"Timeout: [yellow]{timeout if timeout != -1 else 'Default'}[/yellow]s"
+                    f"Timeout: [yellow]{timeout_display}[/yellow]"
                     f"{skip_note}",
                     title="Workflow Configuration",
                     border_style="magenta",

@@ -5,11 +5,13 @@
 Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
 """
 
+from enum import IntEnum
+
 
 # Exit codes
-class ExitCode:
+class ExitCode(IntEnum):
     """Exit codes for CLI commands."""
-    
+
     SUCCESS = 0
     FAILURE = 1
     BUILD_FAILURE = 2

@@ -425,8 +425,10 @@ def format_performance(perf):
                     return f"{val:,.0f}"
                 elif val >= 10:
                     return f"{val:.1f}"
+                elif val >= 0.01:
+                    return f"{val:.4f}"
                 else:
-                    return f"{val:.2f}"
+                    return f"{val:.4g}"
             except (ValueError, TypeError):
                 return str(perf)
 

@@ -395,6 +395,8 @@ def process_batch_manifest_entries(
 
         # If the model was not built (build_new=false), create an entry for it
         if not build_new:
+            # Initialize with a safe fallback so the except block can always reference it
+            dockerfile_matched = "unknown"
             # Find the model configuration by discovering models with this tag
             try:
                 # Create a temporary args object to discover the model