Skip to content

baker+ci: richer failure reporting — error class + source URL + failed-channel list #157

@gerchowl

Description

@gerchowl

Why

/spike 148 operator UX review: when a shard fails overnight and pages someone, today's logs answer "which shard" but not "why class" (toktx vs HTTP vs disk-full vs something else). The `notify-on-failure` issue body says "at least one shard failed" — operators always have to click through to the raw log.

What

Three small improvements across `hf_derive.py` + `derive.yml` + the `notify-on-failure` action:

  1. Error class tagging. When `_stream_transform_into_tar` raises a terminal gate, tag the error with a class derived from the first exception type:

    • `toktx` — `RuntimeError("toktx exit N: ...")`
    • `http` — `requests.HTTPError` / `Timeout`
    • `disk` — `OSError` with `errno.ENOSPC`
    • `unknown` — anything else

    Write the class into `$GITHUB_STEP_SUMMARY` and as a job output (`echo "error_class=toktx" >>"$GITHUB_OUTPUT"`). The report job reads the outputs across matrix legs and includes in the issue title (e.g. `[derive-failed] ktx2 ambientcg/2k (toktx, 2 shards)`).

  2. Source tar URL at shard start. `log.info("source: %s", tar_url)` so the log has the resolve URL (not just the SHA) for one-line reproduction.

  3. Aggregated failed-channel list. When the terminal gate trips, write up to 10 `mid/ch: err` lines to the step summary — not just `first_error`.

Pitfalls

  • Matrix job outputs are awkward to aggregate. If the plumbing gets gnarly, fall back to having each shard write its error class to a shared artifact and the report job reads them.
  • Don't leak HF_TOKEN in error URLs.

Acceptance criteria

  • Failed run's notify issue title/body names the error class and how many shards failed.
  • `source:` URL appears in the first 10 log lines of every shard.
  • Failed-channel list (top-10) appears in the step summary when the terminal gate fires.

Refs /spike 148 review, PR #148.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:bakerBaker pipeline, Dagger, data fetchersarea:ciCI/CD, GitHub Actions, workflowspriority:mediumImportant but not urgent

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions