Skip to content

feat(reliability): expand pre-flight static checks for hf_jobs approval#238

Open
Suvradippaul wants to merge 2 commits intohuggingface:mainfrom
Suvradippaul:reliability-checks-expansion
Open

feat(reliability): expand pre-flight static checks for hf_jobs approval#238
Suvradippaul wants to merge 2 commits intohuggingface:mainfrom
Suvradippaul:reliability-checks-expansion

Conversation

@Suvradippaul
Copy link
Copy Markdown

@Suvradippaul Suvradippaul commented May 7, 2026

Expands the static pre-flight check at the CLI approval prompt from one substring rule to five, covering four additional documented training-job failure modes from agent/prompts/system_prompt_v3.yaml. Returns structured Finding(severity, message) so each can render with appropriate emphasis. Checks remain advisory (never blocking); the legacy
check_training_script_save_pattern entry point is preserved for back-compat.

The five checks (and the system-prompt lines they cite):

  1. Save pattern — prompt:39 (kept; recognises trainer.save_model / save_pretrained for the ephemeral-storage warning)
  2. Default 30m timeout + a Trainer/SFT/GRPO call — prompt:37
  3. push_to_hub=True without hub_model_idprompt:39
  4. attn_implementation="flash_attention_2" (legacy literal) — prompt:45 (steers to kernels-community/flash-attn2 per the post-Steer agent to HF kernels instead of pip install flash-attn #204 guidance)
  5. Training entry point without report_to="trackio"prompt:65-70

Docker-mode jobs get the timeout check; script-parsing checks self-skip when no script is present.

Test plan

  • uv run pytest tests/unit/test_reliability_checks.py — 33 passing
  • uv run ruff check agent/utils/reliability_checks.py agent/main.py tests/unit/test_reliability_checks.py — clean
  • No regressions in existing unit tests vs main

Closes #203

Suvradippaul and others added 2 commits May 7, 2026 14:37
system_prompt_v3.yaml documents eight recurring training-job failure
modes but only one (missing model save) was caught at the CLI approval
prompt. Expand the pre-flight check to cover four additional documented
pitfalls — default 30m timeout on a training call, push_to_hub=True
without hub_model_id, flash_attention_2 without flash-attn in deps, and
missing report_to="trackio" — and switch to a structured
Finding(severity, message) return so each can render with appropriate
emphasis. Checks stay advisory (never blocking); the legacy entry point
is preserved for back-compat. Adds tests/unit/test_reliability_checks.py
with parametrized coverage for all five checks plus the legacy wrapper.

Closes huggingface#203

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expand pre-flight static checks for hf_jobs approval

1 participant