Skip to content

fix(poster): eliminate freeze/hang on large file batches#211

Merged
javi11 merged 1 commit intomainfrom
claude/optimistic-shockley-2b811e
Apr 25, 2026
Merged

fix(poster): eliminate freeze/hang on large file batches#211
javi11 merged 1 commit intomainfrom
claude/optimistic-shockley-2b811e

Conversation

@javi11
Copy link
Copy Markdown
Owner

@javi11 javi11 commented Apr 25, 2026

Summary

Fixes the long-standing freeze reported in #168 where uploads hang overnight under large file counts and only recover after a Docker restart. The reporter notes v0.0.29-rc5 was the last stable release; every release since has the bug.

A code-level audit of internal/poster/poster.go revealed several concurrency and resource-management defects that match the symptoms (silent freeze under volume + time, no panic surfaced). All are fixed in this PR with minimal-diff edits — no architectural rework.

Root causes & fixes

# Bug Fix
1 postQueue closed unconditionally right after the initial enqueue loop. Later, checkLoop's retry send select { case postQueue <- failedPost: ; default: } could not catch a closed-channel send and panicked. With no recover(), the goroutine died and wg.Wait() blocked forever. Gate the close on postsInFlight.Wait() (the WaitGroup was already wired but never waited). Rebalance the retry path so the original post is Done'd when its retry is queued.
2 post.file was not closed on postLoop pool-error or verify-exhausted-no-deferred paths. Every failed file leaked an fd; long-running daemons hit the ulimit and stalled on the next os.Open. Close post.file on every terminal path.
3 The deferred-check errChan send was the only one without a ctx.Done() companion — buffer 1 with multiple writers could deadlock the goroutine. ctx-guarded select and bump buffer to 4.
4 Per-post read-ahead goroutine observed only the parent ctx. On the deferred-check non-fatal path the parent stays alive, so stragglers could linger. Per-post context.WithCancel cancelled after pool.Wait().
5 failedPosts was an int incremented from checkLoop and read from the main goroutine without synchronization. atomic.Int64 (and Post.failed accordingly).
6 Misleading default: branch in the retry select claimed to handle a closed channel — sends on closed channels always panic, default cannot intercept. Removed; with the gating fix the branch is unreachable.

Test plan

  • go build ./...
  • go vet ./internal/poster/...
  • go test -race -count=1 ./internal/poster/... ./internal/processor/...
  • Manual soak: large directory upload overnight; verify completion and flat fd count.

Notes for the reviewer

  • The nntppool v2 → v4 upgrade (feat: update nntppool dependency to v4.3.0 and refactor related code #130) and other post-rc5 changes are explicitly out of scope. If hangs persist after this lands, the pool layer is the next place to audit.
  • The retry path now performs postsInFlight.Done() for the original post on successful retry-send. Without this, postsInFlight.Wait() would never reach zero and postQueue would never close.

Several long-standing concurrency and resource-management issues caused
uploads to silently freeze when handling many files (issue #168 — user
reports rc5 was the last stable release, all subsequent versions hang
overnight requiring a Docker restart):

- postQueue was closed unconditionally right after enqueuing initial
  files (poster.go:266). When checkLoop later tried to re-enqueue a
  failed verification, the `select { case postQueue <- ...; default: }`
  could not catch the closed-channel send and panicked. With no recover
  in the package, the goroutine died and wg.Wait() blocked forever.
  postsInFlight was already wired with Add/Done but never Wait()ed —
  fix gates the close on postsInFlight.Wait() and rebalances the retry
  path so the WaitGroup actually reaches zero.

- post.file was never closed on the postLoop pool-error path or on the
  verify-exhausted (deferred disabled) path. Each failed file leaked an
  fd; long-running daemons hit the process ulimit and stalled on the
  next os.Open.

- The deferred-check errChan send was the only such send without a
  ctx.Done() companion. Replaced with a guarded select; bumped errChan
  buffer to 4 so concurrent writers cannot block each other.

- Read-ahead goroutine only observed the parent ctx. On the deferred-
  check non-fatal path the parent stays alive, so per-post stragglers
  could linger. Now driven by a per-post ctx cancelled after pool.Wait.

- failedPosts was an int incremented from checkLoop and read from the
  main goroutine without synchronization. Switched to atomic.Int64 (and
  Post.failed accordingly).

- Removed the misleading `default:` branch from the retry select; with
  the gating fix it is unreachable and was always wrong.

Verified with `go test -race ./internal/poster/... ./internal/processor/...`.
@javi11 javi11 merged commit e16313c into main Apr 25, 2026
3 checks passed
@javi11 javi11 deleted the claude/optimistic-shockley-2b811e branch April 25, 2026 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant