You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
agent/utils/reliability_checks.py runs one static check on hf_jobs training scripts at the CLI approval prompt: it warns when from_pretrained appears without push_to_hub. The system prompt at agent/prompts/system_prompt_v3.yaml:29-47 enumerates eight specific training-job failure modes the agent will repeatedly hit. Only one of those eight is statically caught at approval time — the moment the user is about to spend real GPU credits on a run.
This issue proposes expanding the pre-flight check to cover four additional documented pitfalls, returning structured findings (severity + message) so each can be rendered with appropriate emphasis. The check stays advisory (never blocking), keeps the legacy entry point for backward compatibility, and ships with a parametrized test file (currently zero coverage on this module — confirmed via grep -r reliability_checks tests/).
Current behavior
check_training_script_save_pattern at agent/utils/reliability_checks.py:4-14 is the entire safety net. It is called from exactly one place — the CLI approval render at agent/main.py:443 — and only fires for the from_pretrained + no-push_to_hub pattern. A user approving an hf_jobs call with a 30-minute default timeout, missing hub_model_id, hardcoded flash_attention_2 without a flash-attn dependency, and no Trackio configured currently sees zero warnings about any of those — even though the system prompt explicitly flags each as a recurring failure mode.
The web UI sees even less: backend/routes/agent.py:564 (submit_approval) does not import or run the check at all, so web users get no advisory output.
Documented failure modes (system prompt evidence)
Each item below cites the line where system_prompt_v3.yaml already acknowledges the failure mode the agent will repeatedly produce:
prompt:37 — DEFAULT TIMEOUT KILLS JOBS: "leave timeout at the default 30m for training jobs. Training takes hours. The job gets killed and all progress is lost."
prompt:39 — LOST MODELS: forgetting push_to_hub=True and hub_model_id in training config. Job storage is ephemeral.
prompt:45 — HARDCODED UNAVAILABLE PACKAGES: forgetting to install packages like flash-attn for flash_attention_2.
The function inspects the same arguments dict already in scope at agent/main.py:417 (it carries script, command, dependencies, hardware_flavor, timeout, env, schedule). All five checks are pure substring/regex inspection — no network calls.
check_training_script_save_pattern(script) is kept as a thin wrapper around the save-pattern helper, returning the same colored string it does today, so any caller importing the legacy function keeps working.
The CLI render at agent/main.py:443 becomes a small loop:
findings=run_preflight_checks(arguments)
forfinfindings:
# use existing terminal_display color helpers
...
The five checks:
#
Trigger
Severity
Cite
1
Save pattern (kept, expanded to also recognize trainer.save_model)
warn
prompt:39
2
timeout is 30m/missing AND script contains a Trainer/SFT/GRPO call
warn
prompt:37
3
push_to_hub=True matches but no hub_model_id token
warn
prompt:39
4
flash_attention_2 in script AND flash-attn not in deps/pip lines
warn
prompt:45
5
Training entry point detected AND report_to doesn't include trackio
info
prompt:65
Adjacent: not the same as #155 (tool-boundary observability — different layer, different owner). This proposal does not change tool router contracts.
Happy to send a PR if this direction is wanted.
Edge cases
Docker mode (command instead of script): script-parsing checks skip; timeout check still applies.
YOLO/headless mode: approval is bypassed; findings aren't printed. Same behavior as the existing single check today.
Web UI / backend: out of scope here. Closing that gap requires a frontend change and is a separate conversation.
All checks are conservative — e.g. timeout check requires both a default timeout and a clear Trainer call, so it won't fire on inference-only scripts.
Summary
agent/utils/reliability_checks.pyruns one static check onhf_jobstraining scripts at the CLI approval prompt: it warns whenfrom_pretrainedappears withoutpush_to_hub. The system prompt atagent/prompts/system_prompt_v3.yaml:29-47enumerates eight specific training-job failure modes the agent will repeatedly hit. Only one of those eight is statically caught at approval time — the moment the user is about to spend real GPU credits on a run.This issue proposes expanding the pre-flight check to cover four additional documented pitfalls, returning structured findings (severity + message) so each can be rendered with appropriate emphasis. The check stays advisory (never blocking), keeps the legacy entry point for backward compatibility, and ships with a parametrized test file (currently zero coverage on this module — confirmed via
grep -r reliability_checks tests/).Current behavior
check_training_script_save_patternatagent/utils/reliability_checks.py:4-14is the entire safety net. It is called from exactly one place — the CLI approval render atagent/main.py:443— and only fires for thefrom_pretrained+ no-push_to_hubpattern. A user approving anhf_jobscall with a 30-minute default timeout, missinghub_model_id, hardcodedflash_attention_2without aflash-attndependency, and no Trackio configured currently sees zero warnings about any of those — even though the system prompt explicitly flags each as a recurring failure mode.The web UI sees even less:
backend/routes/agent.py:564(submit_approval) does not import or run the check at all, so web users get no advisory output.Documented failure modes (system prompt evidence)
Each item below cites the line where
system_prompt_v3.yamlalready acknowledges the failure mode the agent will repeatedly produce:prompt:37— DEFAULT TIMEOUT KILLS JOBS: "leave timeout at the default 30m for training jobs. Training takes hours. The job gets killed and all progress is lost."prompt:39— LOST MODELS: forgettingpush_to_hub=Trueandhub_model_idin training config. Job storage is ephemeral.prompt:45— HARDCODED UNAVAILABLE PACKAGES: forgetting to install packages likeflash-attnforflash_attention_2.prompt:65-70— Trackio is treated as the core monitoring path (PR Treat Trackio as core for training and prefer public Space dashboards #129). Training withoutreport_to="trackio"is missed observability.The fifth check (model save pattern) is the existing one, retained.
Proposed change
Replace the single-string return contract with a structured-findings list, in
agent/utils/reliability_checks.py:The function inspects the same
argumentsdict already in scope atagent/main.py:417(it carriesscript,command,dependencies,hardware_flavor,timeout,env,schedule). All five checks are pure substring/regex inspection — no network calls.check_training_script_save_pattern(script)is kept as a thin wrapper around the save-pattern helper, returning the same colored string it does today, so any caller importing the legacy function keeps working.The CLI render at
agent/main.py:443becomes a small loop:The five checks:
trainer.save_model)timeoutis30m/missing AND script contains a Trainer/SFT/GRPO callpush_to_hub=Truematches but nohub_model_idtokenflash_attention_2in script ANDflash-attnnot in deps/pip linesreport_todoesn't include trackioAdjacent: not the same as #155 (tool-boundary observability — different layer, different owner). This proposal does not change tool router contracts.
Happy to send a PR if this direction is wanted.
Edge cases
commandinstead ofscript): script-parsing checks skip; timeout check still applies.