Skip to content

ci: migrate 16 of 21 ci.yml jobs to smithy self-hosted runners#262

Open
avrabe wants to merge 4 commits intomainfrom
smithy-migration
Open

ci: migrate 16 of 21 ci.yml jobs to smithy self-hosted runners#262
avrabe wants to merge 4 commits intomainfrom
smithy-migration

Conversation

@avrabe
Copy link
Copy Markdown
Contributor

@avrabe avrabe commented May 3, 2026

Summary

Second pulseengine repo onto the smithy self-hosted fleet, after the
spar pilot (pulseengine/spar#201). Same runner-class mapping, same
workarounds, same expected ~470x end-to-end win driven by queue
elimination on the org-free Actions tier.

Coverage

Class Jobs
rust-cpu clippy, docs-check, test, semver-checks, coverage, proptest, fuzz, msrv
lean-mem miri, mutants, verus
light fmt, yaml-lint, deny, supply-chain, release-results
(hosted) playwright, vscode-extension, audit, kani, rocq

Why these stay on hosted (each commented in-place)

  • playwrightnpx playwright install --with-deps calls
    apt-get install via sudo; smithy runners can't sudo.
  • vscode-extensionxvfb-run + the VS Code Test environment
    expects sudo apt-get for system libs.
  • audit — smithy ships cargo-audit v0.21.2; its bundled rustsec
    parser rejects RUSTSEC-2026-0037 ("unsupported CVSS version: 4.0").
    v0.22.1 fixes it but the install trips on smithy's sccache-on-cc
    setup. Move back once smithy bumps cargo-audit.
  • kani — kani-verifier bundles CBMC (~100 MB); not pre-installed
    on smithy. Migrate when smithy's toolchains role ships kani.
  • rocq — Coq install is heavy; not on smithy yet.

Two non-trivial fixes inside migrated jobs

  1. test: actionlint install moved from
    sudo mv /tmp/actionlint /usr/local/bin to
    mv /tmp/actionlint \$HOME/.local/bin plus a GITHUB_PATH
    update. Smithy runners have no sudo; same binary, different
    writable location.

  2. deny: dropped cargo deny check (which would fail loading
    advisory-db with CVSS 4.0) for
    cargo deny check bans licenses sources. The audit job (still
    on hosted) covers vulnerability matching meanwhile. Same
    workaround spar landed in ci: pilot-migrate clippy job to smithy self-hosted runners spar#201.

Test plan

  • CI run completes — 16 migrated jobs land on the right smithy classes
  • No EACCES events in smithy's journalctl -u smithy-trace-eacces.service
  • Hosted jobs (playwright, vscode-extension, audit, kani, rocq) continue to pass as before
  • Repeat-run shows much shorter total wall time vs the most recent
    green main-branch CI run

Rollback

Revert this commit. Every job goes back to ubuntu-latest.

Builds on the spar pilot (pulseengine/spar#201) — same runner-class
mapping, same workarounds for the rustsec parser CVSS 4.0 issue,
same direct-cargo-deny pattern.

Migrated to smithy:

  rust-cpu      clippy, docs-check, test, semver-checks, coverage,
                proptest, fuzz, msrv
  lean-mem      miri, mutants, verus
  light         fmt, yaml-lint, deny, supply-chain, release-results

Stay on ubuntu-latest (each with explanatory comment in-place):

  - playwright       (--with-deps does sudo apt-get; smithy runners no sudo)
  - vscode-extension (xvfb-run + downloaded VS Code Test setup)
  - audit            (cargo-audit 0.21 rustsec parser rejects CVSS 4.0)
  - kani             (kani-verifier bundles CBMC, ~100 MB install)
  - rocq             (Coq install, not on smithy yet)

Two non-trivial fixes inside migrated jobs:

  - test: actionlint install moved from `sudo mv /tmp/actionlint
    /usr/local/bin` to `mv /tmp/actionlint $HOME/.local/bin` plus
    GITHUB_PATH update. Smithy runners have no sudo; same binary,
    different writable location.
  - deny: dropped the `cargo deny check` (which would fail loading
    advisory-db with CVSS 4.0) for `cargo deny check bans licenses
    sources`. The audit job (still on hosted) covers vulnerability
    matching meanwhile.

Expected improvement: spar's broad migration showed ~470x end-to-end
speedup on clippy (~470 min → 1 min) thanks to queue elimination.
Rivet should see similar — its recent runs showed 600+ min total.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

avrabe added 3 commits May 3, 2026 17:05
First migration run timed out exactly at 15:00 with tests still
progressing (last printed test at ~11:00). Smithy's lean-mem class
appears to run the slow tail tests slower than the previous hosted
runner did — could be cgroup memory pressure (24G MemoryHigh under
Miri's shadow allocations) or just longer tail test perf. Bumping
the budget conservatively; revisit once we have a few green runs
to dial it back closer to actual.

Semver Checks is also failing on this PR — upstream issue
('unsupported rustdoc format v57', the action ships a too-old
cargo-semver-checks). NOT a smithy-migration issue; would fail on
hosted too. Tracked as a separate followup; doesn't block this PR.
Smithy main now points TMPDIR / TMP / TEMP at the per-runner
/var/lib/runners/runnerN/_tmp on lv_runners (500 G), instead of
the host's /tmp on lv_root (80 G). Previous run hit 'no space
left on device' when the rivet HTML-export test ran out of root
FS budget. Runners restarted; this commit triggers a fresh CI.
…tall

obi1kenobi/cargo-semver-checks-action@v2 bundles an older
cargo-semver-checks that doesn't recognise rustdoc JSON v57
(the format current stable rustdoc emits). Every PR run failed
with 'unsupported rustdoc format v57 for file: rivet_core.json'.

Going direct: install the latest cargo-semver-checks at job time
and invoke it. Slightly slower on cold cache but tracks the
upstream rustdoc format. Same end-effect as the wrapper.

Caught during the rivet broad-CI smithy migration (PR #262); not
related to self-hosted vs hosted.
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Rivet Criterion Benchmarks'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.20.

Benchmark suite Current: 62ebeac Previous: 9b45c86 Ratio
link_graph_build/10000 37548089 ns/iter (± 2433137) 29210248 ns/iter (± 1823498) 1.29

This comment was automatically generated by workflow using github-action-benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant