Skip to content

Expand pre-flight static checks for hf_jobs approval #203

@Suvradippaul

Description

@Suvradippaul

Summary

agent/utils/reliability_checks.py runs one static check on hf_jobs training scripts at the CLI approval prompt: it warns when from_pretrained appears without push_to_hub. The system prompt at agent/prompts/system_prompt_v3.yaml:29-47 enumerates eight specific training-job failure modes the agent will repeatedly hit. Only one of those eight is statically caught at approval time — the moment the user is about to spend real GPU credits on a run.

This issue proposes expanding the pre-flight check to cover four additional documented pitfalls, returning structured findings (severity + message) so each can be rendered with appropriate emphasis. The check stays advisory (never blocking), keeps the legacy entry point for backward compatibility, and ships with a parametrized test file (currently zero coverage on this module — confirmed via grep -r reliability_checks tests/).

Current behavior

check_training_script_save_pattern at agent/utils/reliability_checks.py:4-14 is the entire safety net. It is called from exactly one place — the CLI approval render at agent/main.py:443 — and only fires for the from_pretrained + no-push_to_hub pattern. A user approving an hf_jobs call with a 30-minute default timeout, missing hub_model_id, hardcoded flash_attention_2 without a flash-attn dependency, and no Trackio configured currently sees zero warnings about any of those — even though the system prompt explicitly flags each as a recurring failure mode.

The web UI sees even less: backend/routes/agent.py:564 (submit_approval) does not import or run the check at all, so web users get no advisory output.

Documented failure modes (system prompt evidence)

Each item below cites the line where system_prompt_v3.yaml already acknowledges the failure mode the agent will repeatedly produce:

  1. prompt:37 — DEFAULT TIMEOUT KILLS JOBS: "leave timeout at the default 30m for training jobs. Training takes hours. The job gets killed and all progress is lost."
  2. prompt:39 — LOST MODELS: forgetting push_to_hub=True and hub_model_id in training config. Job storage is ephemeral.
  3. prompt:45 — HARDCODED UNAVAILABLE PACKAGES: forgetting to install packages like flash-attn for flash_attention_2.
  4. prompt:65-70 — Trackio is treated as the core monitoring path (PR Treat Trackio as core for training and prefer public Space dashboards #129). Training without report_to="trackio" is missed observability.

The fifth check (model save pattern) is the existing one, retained.

Proposed change

Replace the single-string return contract with a structured-findings list, in agent/utils/reliability_checks.py:

@dataclass(frozen=True)
class Finding:
    severity: Literal["warn", "info"]
    message: str

def run_preflight_checks(arguments: dict) -> list[Finding]: ...

The function inspects the same arguments dict already in scope at agent/main.py:417 (it carries script, command, dependencies, hardware_flavor, timeout, env, schedule). All five checks are pure substring/regex inspection — no network calls.

check_training_script_save_pattern(script) is kept as a thin wrapper around the save-pattern helper, returning the same colored string it does today, so any caller importing the legacy function keeps working.

The CLI render at agent/main.py:443 becomes a small loop:

findings = run_preflight_checks(arguments)
for f in findings:
    # use existing terminal_display color helpers
    ...

The five checks:

# Trigger Severity Cite
1 Save pattern (kept, expanded to also recognize trainer.save_model) warn prompt:39
2 timeout is 30m/missing AND script contains a Trainer/SFT/GRPO call warn prompt:37
3 push_to_hub=True matches but no hub_model_id token warn prompt:39
4 flash_attention_2 in script AND flash-attn not in deps/pip lines warn prompt:45
5 Training entry point detected AND report_to doesn't include trackio info prompt:65

Adjacent: not the same as #155 (tool-boundary observability — different layer, different owner). This proposal does not change tool router contracts.

Happy to send a PR if this direction is wanted.

Edge cases

  • Docker mode (command instead of script): script-parsing checks skip; timeout check still applies.
  • YOLO/headless mode: approval is bypassed; findings aren't printed. Same behavior as the existing single check today.
  • Web UI / backend: out of scope here. Closing that gap requires a frontend change and is a separate conversation.
  • All checks are conservative — e.g. timeout check requires both a default timeout and a clear Trainer call, so it won't fire on inference-only scripts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions