Why
/spike 148 operator UX review: when a shard fails overnight and pages someone, today's logs answer "which shard" but not "why class" (toktx vs HTTP vs disk-full vs something else). The `notify-on-failure` issue body says "at least one shard failed" — operators always have to click through to the raw log.
What
Three small improvements across `hf_derive.py` + `derive.yml` + the `notify-on-failure` action:
-
Error class tagging. When `_stream_transform_into_tar` raises a terminal gate, tag the error with a class derived from the first exception type:
- `toktx` — `RuntimeError("toktx exit N: ...")`
- `http` — `requests.HTTPError` / `Timeout`
- `disk` — `OSError` with `errno.ENOSPC`
- `unknown` — anything else
Write the class into `$GITHUB_STEP_SUMMARY` and as a job output (`echo "error_class=toktx" >>"$GITHUB_OUTPUT"`). The report job reads the outputs across matrix legs and includes in the issue title (e.g. `[derive-failed] ktx2 ambientcg/2k (toktx, 2 shards)`).
-
Source tar URL at shard start. `log.info("source: %s", tar_url)` so the log has the resolve URL (not just the SHA) for one-line reproduction.
-
Aggregated failed-channel list. When the terminal gate trips, write up to 10 `mid/ch: err` lines to the step summary — not just `first_error`.
Pitfalls
- Matrix job outputs are awkward to aggregate. If the plumbing gets gnarly, fall back to having each shard write its error class to a shared artifact and the report job reads them.
- Don't leak HF_TOKEN in error URLs.
Acceptance criteria
Refs /spike 148 review, PR #148.
Why
/spike 148 operator UX review: when a shard fails overnight and pages someone, today's logs answer "which shard" but not "why class" (toktx vs HTTP vs disk-full vs something else). The `notify-on-failure` issue body says "at least one shard failed" — operators always have to click through to the raw log.
What
Three small improvements across `hf_derive.py` + `derive.yml` + the `notify-on-failure` action:
Error class tagging. When `_stream_transform_into_tar` raises a terminal gate, tag the error with a class derived from the first exception type:
Write the class into `$GITHUB_STEP_SUMMARY` and as a job output (`echo "error_class=toktx" >>"$GITHUB_OUTPUT"`). The report job reads the outputs across matrix legs and includes in the issue title (e.g. `[derive-failed] ktx2 ambientcg/2k (toktx, 2 shards)`).
Source tar URL at shard start. `log.info("source: %s", tar_url)` so the log has the resolve URL (not just the SHA) for one-line reproduction.
Aggregated failed-channel list. When the terminal gate trips, write up to 10 `mid/ch: err` lines to the step summary — not just `first_error`.
Pitfalls
Acceptance criteria
Refs /spike 148 review, PR #148.