feat(reliability): expand pre-flight static checks for hf_jobs approval#238
Open
Suvradippaul wants to merge 2 commits intohuggingface:mainfrom
Open
feat(reliability): expand pre-flight static checks for hf_jobs approval#238Suvradippaul wants to merge 2 commits intohuggingface:mainfrom
Suvradippaul wants to merge 2 commits intohuggingface:mainfrom
Conversation
system_prompt_v3.yaml documents eight recurring training-job failure modes but only one (missing model save) was caught at the CLI approval prompt. Expand the pre-flight check to cover four additional documented pitfalls — default 30m timeout on a training call, push_to_hub=True without hub_model_id, flash_attention_2 without flash-attn in deps, and missing report_to="trackio" — and switch to a structured Finding(severity, message) return so each can render with appropriate emphasis. Checks stay advisory (never blocking); the legacy entry point is preserved for back-compat. Adds tests/unit/test_reliability_checks.py with parametrized coverage for all five checks plus the legacy wrapper. Closes huggingface#203 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Expands the static pre-flight check at the CLI approval prompt from one substring rule to five, covering four additional documented training-job failure modes from
agent/prompts/system_prompt_v3.yaml. Returns structuredFinding(severity, message)so each can render with appropriate emphasis. Checks remain advisory (never blocking); the legacycheck_training_script_save_patternentry point is preserved for back-compat.The five checks (and the system-prompt lines they cite):
prompt:39(kept; recognisestrainer.save_model/save_pretrainedfor the ephemeral-storage warning)prompt:37push_to_hub=Truewithouthub_model_id—prompt:39attn_implementation="flash_attention_2"(legacy literal) —prompt:45(steers tokernels-community/flash-attn2per the post-Steer agent to HF kernels instead of pip install flash-attn #204 guidance)report_to="trackio"—prompt:65-70Docker-mode jobs get the timeout check; script-parsing checks self-skip when no
scriptis present.Test plan
uv run pytest tests/unit/test_reliability_checks.py— 33 passinguv run ruff check agent/utils/reliability_checks.py agent/main.py tests/unit/test_reliability_checks.py— cleanmainCloses #203