Skip to content

perf: speed up RTS testing on macOS — split #rts build/check + parallelise test variants#6029

Open
ggreif wants to merge 16 commits intomasterfrom
gabor/rts-parallel-tests
Open

perf: speed up RTS testing on macOS — split #rts build/check + parallelise test variants#6029
ggreif wants to merge 16 commits intomasterfrom
gabor/rts-parallel-tests

Conversation

@ggreif
Copy link
Copy Markdown
Contributor

@ggreif ggreif commented Apr 18, 2026

Summary

Two complementary attacks on RTS test wall-clock — parallelisation makes the test suite finish faster when it does run, and the new build/check split lets nix build .#rts skip running it altogether on Mac (where it has no cache hit).

Build-only .#rts and check-running .#rts-checked (latest)

nix/rts.nix factored into two derivations sharing all build attrs:

  • rts-builddoCheck = false, produces the cross-compiled wasm artifacts only.
  • rts-checkeddoCheck = true, additionally runs make -j8 test.

Wired into flake.nix so:

  • .#rts-checked is always the test-running variant.
  • .#rts aliases rts-checked on Linux (Hydra-cached, fast — no behaviour change) and rts-build on darwin (skips the slow path).
  • nightly-macos-test gains an explicit rts-checked job that builds .#rts-checked directly, so the darwin-side cargo test suite still runs daily on master.

Why: on macos-latest (aarch64-darwin) the host-side cargo test is a from-source build with no Hydra cache hit (combo not in cachix), adding ~10+ min to every artifact build. On Linux it's a fast cache hit. The split is platform-asymmetric on purpose: PR-blocking on Linux, scheduled-only on darwin.

Net effect: nix build .#rts terminates fast on the Mac everybody develops on; PR CI coverage on Linux is unchanged; darwin RTS test coverage moves from "every PR (slow)" to "nightly".

Phase A: Variant-level parallelism

  • Separate CARGO_TARGET_DIR per variant (target-<name>) to avoid cargo lock contention.
  • define/eval Makefile template generates build + per-module run targets.
  • make -j8 test in nix checkPhase.

Phase B: Per-module parallelism via wasmtime --invoke

  • Each test module gets a #[no_mangle] pub extern "C" fn test_<mod>() entry point.
  • Makefile runs wasmtime --invoke test_<mod> per module — works on wasm64-unknown-unknown without WASI.
  • 3 variants × 24 modules = 72 parallel targets.

Phase C: GC seed chunking

  • Split 100 GC random seeds into 10 chunks of 10 (test_gc_chunk_0..9).
  • Separate gc_predefined (hand-crafted heaps) from gc_components (incremental/compacting internals).
  • Split persistence into persistence_small (up to 10k objects) and persistence_20k (20k objects).
  • Heavy tests ordered first in TEST_MODULES so make -j starts long poles early.

Dynamic test seeds

  • Stabilization small tests use a seed derived from git rev-parse HEAD at build time.
  • Each commit exercises different random heap configurations.
  • Fallback to fixed seed 4711 when not in a git repo (nix sandbox).
  • The heavy persistence_20k test uses fixed seed 4711 for predictable CI runtime.

Bug fix: heap size scaling

  • heap_size_for_gc for incremental GC ignored total_heap_size_bytes, always returning 3 * PARTITION_SIZE (192 MB).
  • For seeds generating large object graphs, the dynamic heap exceeded this fixed size.
  • Fix: max(3 * PARTITION_SIZE, 2 * total_heap_size_bytes).
  • Discovered via seed 20_000 which generates a dense 20k-object graph.

Other improvements

  • test -f guard before wasmtime to fail fast if cargo didn't produce the binary.
  • WASMTIME_BACKTRACE_DETAILS=1 for better crash diagnostics.
  • Removed unnecessary unsafe blocks in test entry points.

Observed speedup (parallelisation alone): from 2+ hours sequential to ~90 minutes parallel on macOS, limited by persistence_20k (Amdahl's law). The .#rts build-skip on top of that cuts the Mac path further whenever tests aren't needed (artifacts, dev shells, anything not gating on RTS unit tests).

Test plan

  • CI green
  • nix build .#rts on darwin skips cargo test (timing check)
  • nix build .#rts-checked on darwin runs and passes cargo test
  • nix build .#rts on Linux still runs cargo test (regression guard)
  • Next nightly nightly-macos-test run shows the new rts-checked job — it does

🤖 Generated with Claude Code

@ggreif ggreif requested a review from a team as a code owner April 18, 2026 08:34
@ggreif ggreif self-assigned this Apr 18, 2026
@ggreif ggreif added the testing Related to test suite label Apr 18, 2026
Use CARGO_TARGET_DIR per test variant (target-ni, target-inc, target-64)
to avoid cargo lock contention, enabling `make -j3 test` to build and
run all three RTS test variants in parallel.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ggreif ggreif force-pushed the gabor/rts-parallel-tests branch from 6479860 to 4509ad2 Compare April 18, 2026 09:04
Factor out the repeated test build/run pattern into a reusable
test_variant macro. The cargo target dir is derived from the make
target name (target-<name>).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ggreif ggreif force-pushed the gabor/rts-parallel-tests branch from 4509ad2 to 61dee13 Compare April 18, 2026 09:15
@ggreif ggreif changed the title perf: parallelise RTS test variants chore: parallelise RTS test variants Apr 18, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 18, 2026

Comparing from 6a81fc5 to d8c9c87:
The produced WebAssembly code seems to be completely unchanged.
In terms of gas, no changes are observed in 5 tests.
In terms of size, no changes are observed in 5 tests.

ggreif and others added 10 commits April 18, 2026 12:41
- make -j3 → make -j: the number of test variants is the natural limit
- test -f on the wasm binary before wasmtime: fail fast if cargo didn't
  produce the binary (wasmtime may return 0 on missing file)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add RTS_TEST_FILTER via wasmtime --invoke for per-module entry points
- Split 100 GC random seeds into 10 chunks of 10 (test_gc_chunk_0..9)
- Separate gc_predefined (hand-crafted heaps + components) from random seeds
- 3 variants × 21 modules = 63 parallel wasmtime targets
- Trace markers (>>> <<<) for build diagnostics

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Separate gc_predefined (3 hand-crafted heaps) from gc_components
  (incremental/compacting/generational internal tests)
- Split persistence into persistence_small (up to 10k objects) and
  persistence_20k (the heavy 20k serialization test)
- Order TEST_MODULES heaviest-first so make -j starts long poles early
- Make incremental GC sub-modules public for per-component entry points

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Seed 20_000 caused a slice_index_fail in heap construction.
Use the same seed as the other stabilization tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Unlimited -j with 72 parallel wasmtime targets can exhaust memory
on CI runners. Cap at 8 concurrent processes as a safe default.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The CI failure was likely OOM from unbounded parallelism, not the seed.
With -j8 cap, seed 20_000 should work. Remove >>> <<< debug traces.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
heap_size_for_gc ignored total_heap_size_bytes, always returning
3*PARTITION_SIZE (192 MB). For seeds that generate large object graphs
(e.g. seed 20_000 with 20k objects), the dynamic heap exceeds this
fixed size, causing slice_index_fail in create_dynamic_heap.

Fix: use max(3*PARTITION_SIZE, 2*total_heap_size_bytes) so the heap
grows to fit the actual content.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Vary the RNG seed across commits so different random heaps are tested
over time. Seed is derived from git rev-parse HEAD at build time,
with fallback to "4711" when not in a git repo (nix sandbox).

Also enable WASMTIME_BACKTRACE_DETAILS=1 for better crash diagnostics.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
gc::*, stable_option::test(), and stabilization sub-tests are safe
functions — no unsafe block needed. Also run cargo fmt.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… only

The 20k test is too expensive to risk worst-case seeds. Keep it
deterministic with a known-good seed. Small tests vary per commit
to explore different heap shapes over time.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ggreif ggreif force-pushed the gabor/rts-parallel-tests branch from 63e23f9 to a6f03f7 Compare April 18, 2026 16:20
@alexandru-uta
Copy link
Copy Markdown
Contributor

What are the added benefits? Is it really faster, how much?

@ggreif
Copy link
Copy Markdown
Contributor Author

ggreif commented Apr 20, 2026

What are the added benefits? Is it really faster, how much?

It is still dominated by the 20000-tree, but at least this one now runs in parallel to the others. I was fed up with the slowness on the Mac, so this might help. But I haven't done A/B testing yet.

The other thing is that this introduces different rand seeds per 10000-tree. The fixed seed is kept for the big one for less surprises in run time.

@ggreif ggreif marked this pull request as draft April 20, 2026 08:10
@ggreif
Copy link
Copy Markdown
Contributor Author

ggreif commented Apr 20, 2026

Keeping this as draft, as I am brainstorming how the bottleneck can be improved.

EDIT: The bottleneck is resolved by disabling the checkPhase for MacOS and shifting it to a nightly CI task. A proper night-shift, so to say :-)

ggreif and others added 2 commits April 26, 2026 14:31
…test)

The host-side `cargo test` suite for the RTS dominates wall-clock on
macos-latest CI: ~10+ min from-source build with no Hydra cache hit
(aarch64-darwin combo not in cachix). On Linux it's a fast Hydra cache
hit. So the cost is platform-asymmetric.

Solution: factor `nix/rts.nix` into two derivations sharing all build
attrs:
- `rts-build` — `doCheck = false`, produces the cross-compiled wasm
  artifacts only.
- `rts-checked` — `doCheck = true`, additionally runs `make -j8 test`.

Wire into `flake.nix` such that:
- `.#rts-checked` is always the test-running variant.
- `.#rts` aliases `rts-checked` on Linux (keeps PR CI coverage as is —
  Hydra-cached, fast) and `rts-build` on darwin (skips the slow path).

The `nightly-macos-test` workflow gains an explicit `rts-checked`
job that builds `.#rts-checked` directly, so the darwin-side cargo
test suite still runs daily on master, just not on every PR.

Net effect:
- Mac PR/artifacts CI: skips 10+ min of cargo test, fast cache hits.
- Linux PR CI: unchanged.
- Darwin RTS test coverage: moves from "every PR (slow)" to "nightly".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ggreif ggreif force-pushed the gabor/rts-parallel-tests branch from 4c85ca3 to 41f380f Compare April 26, 2026 12:39
@ggreif ggreif changed the title chore: parallelise RTS test variants perf: speed up RTS testing on macOS — split #rts build/check + parallelise test variants Apr 26, 2026
@ggreif ggreif marked this pull request as ready for review April 26, 2026 12:51
@ggreif ggreif requested a review from alexandru-uta April 27, 2026 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

testing Related to test suite

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants