Skip to content

Fix GL renderer thread-safety hazards and a STREAM upload CCE#22

Open
onefuncman wants to merge 12 commits intodolda2000:masterfrom
onefuncman:glrender-crash-fix
Open

Fix GL renderer thread-safety hazards and a STREAM upload CCE#22
onefuncman wants to merge 12 commits intodolda2000:masterfrom
onefuncman:glrender-crash-fix

Conversation

@onefuncman
Copy link
Copy Markdown

Summary

Addresses a cluster of thread-safety hazards in the GL renderer (GLEnvironment, StreamBuffer) surfaced while
chasing intermittent crashes and seeing corrupted render output. Splits the work into small, reviewable commits and
ships them with a JUnit test suite covering the non-GL parts (pool, fill-proxy, render queue, seq-ring) plus an
architecture writeup under doc/.

What changed

  • Eliminate shared GLEnvironment.prep GLRender — each prepare(...) call now builds its own private GLRender
    and enqueues it via a new prepq, drained by process() ahead of submitted. Removes a multi-writer hazard where
    concurrent prepare calls could interleave on one shared render.
  • Decouple STREAM prepare from buf.ro publication — fixes a ClassCastException where fillbuf was deciding
    StreamBuffer.Fill vs FillBuffers.Array based on buf.ro, which was published after prep enqueue. New
    runStreamFill pre-allocates the Fill and runs the Filler against a proxy Environment so fillbuf becomes a
    pure factory. Zero-copy preserved for standard fillers; non-standard fillers fall back to pull(push().flip()).
    Filler interface and all in-repo implementors unchanged.
  • Extract StreamBuffer.Pool with an injectable IntFunction<SysBuffer> allocator so the pool lifecycle is
    testable outside a live GL context. Same locking semantics and external API.
  • Extract StreamFiller helper for the proxy/copy path to lock the CCE regression behind a test.
  • Extract RenderQueue from GLEnvironment and lock the prep/submit invariants behind tests.
  • Route GLRender.update STREAM writes through runStreamFill so the refactored path is the only one.
  • Size the dispose ring to observed steady-state instead of an unbounded grow-only.
  • Rebind EBO at draw time + on same-VAO/EBO-only state transition — fixes a case where a reused VAO kept a stale
    EBO binding.
  • Plug Sequence leaks in submit/prepare and add a seq-ring for diagnostics when a leak is suspected.

Tests

New src/test/java/haven/render/gl/ suite: StreamBufferPoolTest, StreamBufferFillTest, StreamFillerTest,
RenderQueueTest. Run with ant test — 26/26 pass. Test targets live alongside the existing bin target and don't
touch the main jar's classpath.

Docs

  • doc/gl-rendering.md — architecture and invariants of the GL layer as they stand after this PR.
  • doc/gl-test-coverage.md — what's covered, remaining gaps (cheap/medium/hard tiers).

Test plan

  • ant bin builds clean.
  • ant test passes.
  • Run the client through the scenarios that originally triggered the CCE and the prep-race crashes.

Removed the shared GLRender prep field that all prepare(...) callers
wrote into. Each prepare call now builds its own private GLRender and
enqueues it into a new prepq queue, drained by process() ahead of
submitted (preserving the prep-before-submit ordering).

- Hazard 1 (multi-writer prep): prepare(GLObject), prepare(BGL.Request),
  and prepare(Consumer<GLRender>) each construct a fresh per-call
  GLRender via enqprep(). No more shared writer.

- Hazard 2 (buf.ro published before data store written): in
  prepare(Model.Indices) and prepare(VertexArray.Buffer) STREAM/STATIC
  cases, the upload work is now enqueued into prepq before buf.ro is
  assigned. The happens-before chain (writer enqueues prep -> publishes
  ro under buf monitor -> reader reads ro -> reader submits) combined
  with process()'s snapshot order (submitted then prepq) guarantees any
  reader-submitted render referencing the new ro lands behind our prep
  in execution order.

- Hazard 3 (disposal interleaves with in-flight prep): dispose() sets
  invalid (now volatile) up front, then drains submitted and prepq,
  aborting and disposing each. enqprep checks invalid under prepmon, so
  any prep submitted after dispose is rejected and aborted instead of
  leaking.
Fixes a ClassCastException introduced by f8ef8d6 (GLEnvironment:
eliminate this.prep shared render). That commit reordered the STREAM
prepare paths to assign buf.ro after enqueueing the upload prep, but
fillbuf(DataBuffer, int, int) was deciding whether to return a
StreamBuffer.Fill or a FillBuffers.Array based on the current buf.ro,
so the cast `(StreamBuffer.Fill)buf.init.fill(buf, this)` started
hitting FillBuffers.Array.

Rather than re-couple buf.ro ordering to the Fill-subtype decision,
remove the coupling entirely:

- New runStreamFill(StreamBuffer, T, Filler) helper pre-allocates the
  StreamBuffer.Fill and runs the Filler against a proxy Environment
  whose fillbuf returns that pre-allocated Fill for the target buffer.
  Standard Fillers call env.fillbuf(tgt) and write into the result, so
  they end up writing straight into the Fill (zero-copy preserved). A
  non-standard Filler that bypasses env.fillbuf falls back to a
  pull(push().flip()) copy.

- fillbuf(DataBuffer, int, int) becomes a pure factory that always
  returns a FillBuffers.Array. The buf.ro inspection is gone, so the
  cast at the prepare callsite is structurally guaranteed to receive
  the Fill type runStreamFill constructed.

- Hazard-2 ordering from f8ef8d6 is restored: buf.ro is published
  after the prep is enqueued in both STREAM branches.

Filler interface and its 31-ish in-repo implementors are unchanged.

Testability + regression coverage:

- StreamBuffer's get/put/dispose pool moves into a package-private
  StreamBuffer.Pool with an injectable IntFunction<SysBuffer>
  allocator. Same locking semantics, same external API on StreamBuffer.

- New StreamBufferPoolTest covers reuse-after-put, two-concurrent-gets
  allocate two, freed-slot reuse, rewind on reuse, dispose releases
  all, harmless put of unknown buffer, NPE on null put.

- doc/gl-test-coverage.md catalogues remaining GLEnvironment-side
  coverage gaps (cheap, medium, hard tiers) and recommends the next
  round of investment.
Pin the lowest-index-free-slot reuse policy and stress get/put from 8
threads to guard the synchronized-monitor invariants (mutual exclusion,
bounded growth). Demote the Fill-lifecycle test from cheap to medium
in the coverage doc -- it needs a test-only ctor since StreamBuffer's
real ctor builds a GLBuffer(env) and calls env.prepare.
Move the proxy-Environment factory and copy fallback out of
GLEnvironment.runStreamFill into a static StreamFiller.runWithPreallocated
so the whole-range / partial-range / other-target / bypass branches can
be exercised against fakes. The whole-range case is the direct guard for
the ClassCastException fixed in fa07814.
Decouple Fill.compatible from rbuf.env via a dedicated compatEnv field
and add a package-private StreamBuffer(int, Pool, Environment) ctor
that skips the GLBuffer/env wiring. StreamBufferFillTest then drives
Fill through compatible / dispose-idempotent / get-then-dispose-no-
auto-put / pull / get-rewinds without standing up a GL context.
The submitted/prepq/invalid trio gets pulled into RenderQueue<R>, with
drain() preserving the "snapshot submitted before prep" ordering that
guarantees every render in a drain has its prep in the same drain (or
a prior one). RenderQueueTest stresses that invariant from a producer
thread across 200 trials and pins enqueue rejection after invalidate(),
which the GLEnvironment.submit / enqprep wrappers now lean on.
The two STREAM branches in update() were sibling sites to the bug
fixed in fa07814 -- they cast (StreamBuffer.Fill) on the result of
fill.fill(buf, env), which only worked while GLEnvironment.fillbuf
inspected buf.ro to choose between Fill and FillBuffers.Array. After
fa07814, fillbuf is a pure Array factory, so these casts blew up
on the first STREAM index/vertex update. Promote runStreamFill from
private to package-private and reuse it here so the same proxy-based
fast path covers GLRender.update too.
Initial sequse size jumps from 16 to 32k to match observed in-flight
Sequence count during normal rendering, eliminating the warm-up
resize churn (16 -> 32k goes through 11 doublings, each one used
to fire a warning past 16384). Bump the warning watermark to 4x
of steady-state (0x20000) so a real leak still surfaces but
transient spikes don't spam warnings.
Fixes recurring `EXCEPTION_ACCESS_VIOLATION` in `glDrawElements` seen
across NVIDIA (nvoglv64.dll) and AMD (atio6axx.dll) drivers, with
varying `count` arguments. In every crash, `indices=0` and the driver
vector-loaded from NULL -- meaning no `GL_ELEMENT_ARRAY_BUFFER` was
bound to the active VAO at draw time, so the driver fell back to
treating `indices` as a client pointer.

Two independent layers, both kept:

1. `VaoBindState.applyto`: previously only rebound the EBO when the
   VAO itself changed. A `(vao=V, ebo=A) -> (vao=V, ebo=B)` transition
   silently dropped the EBO bind, leaving the VAO's internal EBO slot
   pointing at whichever buffer was current when the VAO was first
   populated. If that buffer was later deleted, the next draw read
   NULL. Now also rebinds when only the EBO changed, mirroring the
   unconditional `apply()` semantics.

2. `GLDrawList.SlotRender.draw`: emit `glBindBuffer(EBO)` directly into
   the per-draw `BufferBGL` immediately before `glDrawElements`. This
   makes the draw correct regardless of any state-tracker or
   VAO-internal-slot confusion. Bumps initial `BufferBGL` capacity to
   2 (it grows anyway, just a hint).
…agnostics

Three small changes addressing the observed ~32k-span seqhead/seqtail
gap. In this ring, `sequnreg` only advances `seqtail` when the freed
slot is exactly at the tail, so a single never-disposed Sequence blocks
all subsequent tail advancement even if thousands of younger sequences
have been properly freed.

1. submit(): if the GLRender has no recorded commands (gl == null), the
   early return used to drop it silently. The caller already handed
   over ownership via submit, so the render's Sequence sat in the ring
   until finalization. Now dispose before returning.

2. prepare(GLObject|BGL.Request|Consumer<GLRender>): the GLRender was
   constructed before the user's work ran; any throw skipped enqprep
   and leaked the render. Each overload now uses try/finally with a
   success flag so a throw (including Error) disposes the partial
   render.

3. Diagnostics: added seqstats() returning {span, alive} = {seqhead -
   seqtail, count of sequse[]==true}. If span >> alive, the tail is
   stuck. process() logs the pair at a coarse cadence (every ~30k
   frames) when span > 1000.
Map of src/haven/render/gl/: GLEnvironment/GLRender/GLDrawList, the BGL
buffered-GL indirection, frame lifecycle (prep -> submitted ->
disposeall), RenderQueue's submitted-first drain invariant, GLObject
refcounting with deferred dispose via the seqhead/seqtail ring, VAO/EBO
binding (including the historical crash bug and both fix layers),
STREAM buffer upload ordering, state diffing via Applier/curstate, the
thread model with a submit call-site table, cache lifetimes, and a
chronological list of bugs/fixes applied on this branch plus
instrumentation points.
Add `compile-tests` and `test` ant targets so the GL refactor test
sources under src/test/java can build and run. Main `javac` now excludes
`test/**` since JUnit only belongs on the test classpath. Drops the
bundled junit-platform-console-standalone jar in lib/.
@dolda2000
Copy link
Copy Markdown
Owner

Wow, that's certainly the pull-request of the decade for the Haven client. ;)

It will probably take me a while to go through it and recognize the fundamentals of what you're doing, but I'll get on it. Thanks for the effort in the meantime!

DerKamii pushed a commit to DerKamii/KamiClient that referenced this pull request Apr 15, 2026
…weird crashes in the mines. Credits go to onefuncman.
@onefuncman
Copy link
Copy Markdown
Author

The fixes themselves are quite small number of lines impacted, a lot of the refactor is just for testability, Kami's commit linked here is a nicely boiled down version if you prefer that instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants