Expand pre-flight static checks for hf_jobs approval

## Summary

`agent/utils/reliability_checks.py` runs one static check on `hf_jobs` training scripts at the CLI approval prompt: it warns when `from_pretrained` appears without `push_to_hub`. The system prompt at `agent/prompts/system_prompt_v3.yaml:29-47` enumerates eight specific training-job failure modes the agent will repeatedly hit. Only one of those eight is statically caught at approval time — the moment the user is about to spend real GPU credits on a run.

This issue proposes expanding the pre-flight check to cover four additional documented pitfalls, returning structured findings (severity + message) so each can be rendered with appropriate emphasis. The check stays advisory (never blocking), keeps the legacy entry point for backward compatibility, and ships with a parametrized test file (currently zero coverage on this module — confirmed via `grep -r reliability_checks tests/`).

## Current behavior

`check_training_script_save_pattern` at `agent/utils/reliability_checks.py:4-14` is the entire safety net. It is called from exactly one place — the CLI approval render at `agent/main.py:443` — and only fires for the `from_pretrained` + no-`push_to_hub` pattern. A user approving an `hf_jobs` call with a 30-minute default timeout, missing `hub_model_id`, hardcoded `flash_attention_2` without a `flash-attn` dependency, and no Trackio configured currently sees zero warnings about any of those — even though the system prompt explicitly flags each as a recurring failure mode.

The web UI sees even less: `backend/routes/agent.py:564` (`submit_approval`) does not import or run the check at all, so web users get no advisory output.

## Documented failure modes (system prompt evidence)

Each item below cites the line where `system_prompt_v3.yaml` already acknowledges the failure mode the agent will repeatedly produce:

1. `prompt:37` — DEFAULT TIMEOUT KILLS JOBS: "leave timeout at the default 30m for training jobs. Training takes hours. The job gets killed and all progress is lost."
2. `prompt:39` — LOST MODELS: forgetting `push_to_hub=True` and `hub_model_id` in training config. Job storage is ephemeral.
3. `prompt:45` — HARDCODED UNAVAILABLE PACKAGES: forgetting to install packages like `flash-attn` for `flash_attention_2`.
4. `prompt:65-70` — Trackio is treated as the core monitoring path (PR #129). Training without `report_to="trackio"` is missed observability.

The fifth check (model save pattern) is the existing one, retained.

## Proposed change

Replace the single-string return contract with a structured-findings list, in `agent/utils/reliability_checks.py`:

```python
@dataclass(frozen=True)
class Finding:
    severity: Literal["warn", "info"]
    message: str

def run_preflight_checks(arguments: dict) -> list[Finding]: ...
```

The function inspects the same `arguments` dict already in scope at `agent/main.py:417` (it carries `script`, `command`, `dependencies`, `hardware_flavor`, `timeout`, `env`, `schedule`). All five checks are pure substring/regex inspection — no network calls.

`check_training_script_save_pattern(script)` is kept as a thin wrapper around the save-pattern helper, returning the same colored string it does today, so any caller importing the legacy function keeps working.

The CLI render at `agent/main.py:443` becomes a small loop:

```python
findings = run_preflight_checks(arguments)
for f in findings:
    # use existing terminal_display color helpers
    ...
```

The five checks:

| # | Trigger                                                                 | Severity | Cite       |
|---|-------------------------------------------------------------------------|----------|------------|
| 1 | Save pattern (kept, expanded to also recognize `trainer.save_model`)    | warn     | prompt:39  |
| 2 | `timeout` is `30m`/missing AND script contains a Trainer/SFT/GRPO call  | warn     | prompt:37  |
| 3 | `push_to_hub=True` matches but no `hub_model_id` token                  | warn     | prompt:39  |
| 4 | `flash_attention_2` in script AND `flash-attn` not in deps/pip lines    | warn     | prompt:45  |
| 5 | Training entry point detected AND `report_to` doesn't include trackio   | info     | prompt:65  |

Adjacent: not the same as #155 (tool-boundary observability — different layer, different owner). This proposal does not change tool router contracts.

Happy to send a PR if this direction is wanted.

## Edge cases

- Docker mode (`command` instead of `script`): script-parsing checks skip; timeout check still applies.
- YOLO/headless mode: approval is bypassed; findings aren't printed. Same behavior as the existing single check today.
- Web UI / backend: out of scope here. Closing that gap requires a frontend change and is a separate conversation.
- All checks are conservative — e.g. timeout check requires both a default timeout and a clear Trainer call, so it won't fire on inference-only scripts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand pre-flight static checks for hf_jobs approval #203

Summary

Current behavior

Documented failure modes (system prompt evidence)

Proposed change

Edge cases

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

#	Trigger	Severity	Cite
1	Save pattern (kept, expanded to also recognize `trainer.save_model`)	warn	prompt:39
2	`timeout` is `30m`/missing AND script contains a Trainer/SFT/GRPO call	warn	prompt:37
3	`push_to_hub=True` matches but no `hub_model_id` token	warn	prompt:39
4	`flash_attention_2` in script AND `flash-attn` not in deps/pip lines	warn	prompt:45
5	Training entry point detected AND `report_to` doesn't include trackio	info	prompt:65

Expand pre-flight static checks for hf_jobs approval #203

Description

Summary

Current behavior

Documented failure modes (system prompt evidence)

Proposed change

Edge cases

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions