Fix GL renderer thread-safety hazards and a STREAM upload CCE by onefuncman · Pull Request #22 · dolda2000/hafen-client

onefuncman · 2026-04-14T16:55:22Z

Summary

Addresses a cluster of thread-safety hazards in the GL renderer (GLEnvironment, StreamBuffer) surfaced while
chasing intermittent crashes and seeing corrupted render output. Splits the work into small, reviewable commits and
ships them with a JUnit test suite covering the non-GL parts (pool, fill-proxy, render queue, seq-ring) plus an
architecture writeup under doc/.

What changed

Eliminate shared GLEnvironment.prep GLRender — each prepare(...) call now builds its own private GLRender
and enqueues it via a new prepq, drained by process() ahead of submitted. Removes a multi-writer hazard where
concurrent prepare calls could interleave on one shared render.
Decouple STREAM prepare from buf.ro publication — fixes a ClassCastException where fillbuf was deciding
StreamBuffer.Fill vs FillBuffers.Array based on buf.ro, which was published after prep enqueue. New
runStreamFill pre-allocates the Fill and runs the Filler against a proxy Environment so fillbuf becomes a
pure factory. Zero-copy preserved for standard fillers; non-standard fillers fall back to pull(push().flip()).
Filler interface and all in-repo implementors unchanged.
Extract StreamBuffer.Pool with an injectable IntFunction<SysBuffer> allocator so the pool lifecycle is
testable outside a live GL context. Same locking semantics and external API.
Extract StreamFiller helper for the proxy/copy path to lock the CCE regression behind a test.
Extract RenderQueue from GLEnvironment and lock the prep/submit invariants behind tests.
Route GLRender.update STREAM writes through runStreamFill so the refactored path is the only one.
Size the dispose ring to observed steady-state instead of an unbounded grow-only.
Rebind EBO at draw time + on same-VAO/EBO-only state transition — fixes a case where a reused VAO kept a stale
EBO binding.
Plug Sequence leaks in submit/prepare and add a seq-ring for diagnostics when a leak is suspected.

Tests

New src/test/java/haven/render/gl/ suite: StreamBufferPoolTest, StreamBufferFillTest, StreamFillerTest,
RenderQueueTest. Run with ant test — 26/26 pass. Test targets live alongside the existing bin target and don't
touch the main jar's classpath.

Docs

doc/gl-rendering.md — architecture and invariants of the GL layer as they stand after this PR.
doc/gl-test-coverage.md — what's covered, remaining gaps (cheap/medium/hard tiers).

Test plan

ant bin builds clean.
ant test passes.
Run the client through the scenarios that originally triggered the CCE and the prep-race crashes.

Removed the shared GLRender prep field that all prepare(...) callers wrote into. Each prepare call now builds its own private GLRender and enqueues it into a new prepq queue, drained by process() ahead of submitted (preserving the prep-before-submit ordering). - Hazard 1 (multi-writer prep): prepare(GLObject), prepare(BGL.Request), and prepare(Consumer<GLRender>) each construct a fresh per-call GLRender via enqprep(). No more shared writer. - Hazard 2 (buf.ro published before data store written): in prepare(Model.Indices) and prepare(VertexArray.Buffer) STREAM/STATIC cases, the upload work is now enqueued into prepq before buf.ro is assigned. The happens-before chain (writer enqueues prep -> publishes ro under buf monitor -> reader reads ro -> reader submits) combined with process()'s snapshot order (submitted then prepq) guarantees any reader-submitted render referencing the new ro lands behind our prep in execution order. - Hazard 3 (disposal interleaves with in-flight prep): dispose() sets invalid (now volatile) up front, then drains submitted and prepq, aborting and disposing each. enqprep checks invalid under prepmon, so any prep submitted after dispose is rejected and aborted instead of leaking.

Fixes a ClassCastException introduced by f8ef8d6 (GLEnvironment: eliminate this.prep shared render). That commit reordered the STREAM prepare paths to assign buf.ro after enqueueing the upload prep, but fillbuf(DataBuffer, int, int) was deciding whether to return a StreamBuffer.Fill or a FillBuffers.Array based on the current buf.ro, so the cast `(StreamBuffer.Fill)buf.init.fill(buf, this)` started hitting FillBuffers.Array. Rather than re-couple buf.ro ordering to the Fill-subtype decision, remove the coupling entirely: - New runStreamFill(StreamBuffer, T, Filler) helper pre-allocates the StreamBuffer.Fill and runs the Filler against a proxy Environment whose fillbuf returns that pre-allocated Fill for the target buffer. Standard Fillers call env.fillbuf(tgt) and write into the result, so they end up writing straight into the Fill (zero-copy preserved). A non-standard Filler that bypasses env.fillbuf falls back to a pull(push().flip()) copy. - fillbuf(DataBuffer, int, int) becomes a pure factory that always returns a FillBuffers.Array. The buf.ro inspection is gone, so the cast at the prepare callsite is structurally guaranteed to receive the Fill type runStreamFill constructed. - Hazard-2 ordering from f8ef8d6 is restored: buf.ro is published after the prep is enqueued in both STREAM branches. Filler interface and its 31-ish in-repo implementors are unchanged. Testability + regression coverage: - StreamBuffer's get/put/dispose pool moves into a package-private StreamBuffer.Pool with an injectable IntFunction<SysBuffer> allocator. Same locking semantics, same external API on StreamBuffer. - New StreamBufferPoolTest covers reuse-after-put, two-concurrent-gets allocate two, freed-slot reuse, rewind on reuse, dispose releases all, harmless put of unknown buffer, NPE on null put. - doc/gl-test-coverage.md catalogues remaining GLEnvironment-side coverage gaps (cheap, medium, hard tiers) and recommends the next round of investment.

Pin the lowest-index-free-slot reuse policy and stress get/put from 8 threads to guard the synchronized-monitor invariants (mutual exclusion, bounded growth). Demote the Fill-lifecycle test from cheap to medium in the coverage doc -- it needs a test-only ctor since StreamBuffer's real ctor builds a GLBuffer(env) and calls env.prepare.

Move the proxy-Environment factory and copy fallback out of GLEnvironment.runStreamFill into a static StreamFiller.runWithPreallocated so the whole-range / partial-range / other-target / bypass branches can be exercised against fakes. The whole-range case is the direct guard for the ClassCastException fixed in fa07814.

Decouple Fill.compatible from rbuf.env via a dedicated compatEnv field and add a package-private StreamBuffer(int, Pool, Environment) ctor that skips the GLBuffer/env wiring. StreamBufferFillTest then drives Fill through compatible / dispose-idempotent / get-then-dispose-no- auto-put / pull / get-rewinds without standing up a GL context.

The submitted/prepq/invalid trio gets pulled into RenderQueue<R>, with drain() preserving the "snapshot submitted before prep" ordering that guarantees every render in a drain has its prep in the same drain (or a prior one). RenderQueueTest stresses that invariant from a producer thread across 200 trials and pins enqueue rejection after invalidate(), which the GLEnvironment.submit / enqprep wrappers now lean on.

The two STREAM branches in update() were sibling sites to the bug fixed in fa07814 -- they cast (StreamBuffer.Fill) on the result of fill.fill(buf, env), which only worked while GLEnvironment.fillbuf inspected buf.ro to choose between Fill and FillBuffers.Array. After fa07814, fillbuf is a pure Array factory, so these casts blew up on the first STREAM index/vertex update. Promote runStreamFill from private to package-private and reuse it here so the same proxy-based fast path covers GLRender.update too.

Initial sequse size jumps from 16 to 32k to match observed in-flight Sequence count during normal rendering, eliminating the warm-up resize churn (16 -> 32k goes through 11 doublings, each one used to fire a warning past 16384). Bump the warning watermark to 4x of steady-state (0x20000) so a real leak still surfaces but transient spikes don't spam warnings.

Fixes recurring `EXCEPTION_ACCESS_VIOLATION` in `glDrawElements` seen across NVIDIA (nvoglv64.dll) and AMD (atio6axx.dll) drivers, with varying `count` arguments. In every crash, `indices=0` and the driver vector-loaded from NULL -- meaning no `GL_ELEMENT_ARRAY_BUFFER` was bound to the active VAO at draw time, so the driver fell back to treating `indices` as a client pointer. Two independent layers, both kept: 1. `VaoBindState.applyto`: previously only rebound the EBO when the VAO itself changed. A `(vao=V, ebo=A) -> (vao=V, ebo=B)` transition silently dropped the EBO bind, leaving the VAO's internal EBO slot pointing at whichever buffer was current when the VAO was first populated. If that buffer was later deleted, the next draw read NULL. Now also rebinds when only the EBO changed, mirroring the unconditional `apply()` semantics. 2. `GLDrawList.SlotRender.draw`: emit `glBindBuffer(EBO)` directly into the per-draw `BufferBGL` immediately before `glDrawElements`. This makes the draw correct regardless of any state-tracker or VAO-internal-slot confusion. Bumps initial `BufferBGL` capacity to 2 (it grows anyway, just a hint).

…agnostics Three small changes addressing the observed ~32k-span seqhead/seqtail gap. In this ring, `sequnreg` only advances `seqtail` when the freed slot is exactly at the tail, so a single never-disposed Sequence blocks all subsequent tail advancement even if thousands of younger sequences have been properly freed. 1. submit(): if the GLRender has no recorded commands (gl == null), the early return used to drop it silently. The caller already handed over ownership via submit, so the render's Sequence sat in the ring until finalization. Now dispose before returning. 2. prepare(GLObject|BGL.Request|Consumer<GLRender>): the GLRender was constructed before the user's work ran; any throw skipped enqprep and leaked the render. Each overload now uses try/finally with a success flag so a throw (including Error) disposes the partial render. 3. Diagnostics: added seqstats() returning {span, alive} = {seqhead - seqtail, count of sequse[]==true}. If span >> alive, the tail is stuck. process() logs the pair at a coarse cadence (every ~30k frames) when span > 1000.

Map of src/haven/render/gl/: GLEnvironment/GLRender/GLDrawList, the BGL buffered-GL indirection, frame lifecycle (prep -> submitted -> disposeall), RenderQueue's submitted-first drain invariant, GLObject refcounting with deferred dispose via the seqhead/seqtail ring, VAO/EBO binding (including the historical crash bug and both fix layers), STREAM buffer upload ordering, state diffing via Applier/curstate, the thread model with a submit call-site table, cache lifetimes, and a chronological list of bugs/fixes applied on this branch plus instrumentation points.

Add `compile-tests` and `test` ant targets so the GL refactor test sources under src/test/java can build and run. Main `javac` now excludes `test/**` since JUnit only belongs on the test classpath. Drops the bundled junit-platform-console-standalone jar in lib/.

dolda2000 · 2026-04-15T18:03:19Z

Wow, that's certainly the pull-request of the decade for the Haven client. ;)

It will probably take me a while to go through it and recognize the fundamentals of what you're doing, but I'll get on it. Thanks for the effort in the meantime!

…weird crashes in the mines. Credits go to onefuncman.

onefuncman · 2026-04-15T20:37:55Z

The fixes themselves are quite small number of lines impacted, a lot of the refactor is just for testability, Kami's commit linked here is a nicely boiled down version if you prefer that instead.

onefuncman added 12 commits April 14, 2026 11:48

DerKamii pushed a commit to DerKamii/KamiClient that referenced this pull request Apr 15, 2026

distilled down a bigger PR from dolda2000#22 which should solve some …

feb5af3

…weird crashes in the mines. Credits go to onefuncman.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GL renderer thread-safety hazards and a STREAM upload CCE#22

Fix GL renderer thread-safety hazards and a STREAM upload CCE#22
onefuncman wants to merge 12 commits intodolda2000:masterfrom
onefuncman:glrender-crash-fix

onefuncman commented Apr 14, 2026

Uh oh!

dolda2000 commented Apr 15, 2026

Uh oh!

onefuncman commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

onefuncman commented Apr 14, 2026

Summary

What changed

Tests

Docs

Test plan

Uh oh!

dolda2000 commented Apr 15, 2026

Uh oh!

onefuncman commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants