fix(poster): eliminate freeze/hang on large file batches#211
Merged
Conversation
Several long-standing concurrency and resource-management issues caused uploads to silently freeze when handling many files (issue #168 — user reports rc5 was the last stable release, all subsequent versions hang overnight requiring a Docker restart): - postQueue was closed unconditionally right after enqueuing initial files (poster.go:266). When checkLoop later tried to re-enqueue a failed verification, the `select { case postQueue <- ...; default: }` could not catch the closed-channel send and panicked. With no recover in the package, the goroutine died and wg.Wait() blocked forever. postsInFlight was already wired with Add/Done but never Wait()ed — fix gates the close on postsInFlight.Wait() and rebalances the retry path so the WaitGroup actually reaches zero. - post.file was never closed on the postLoop pool-error path or on the verify-exhausted (deferred disabled) path. Each failed file leaked an fd; long-running daemons hit the process ulimit and stalled on the next os.Open. - The deferred-check errChan send was the only such send without a ctx.Done() companion. Replaced with a guarded select; bumped errChan buffer to 4 so concurrent writers cannot block each other. - Read-ahead goroutine only observed the parent ctx. On the deferred- check non-fatal path the parent stays alive, so per-post stragglers could linger. Now driven by a per-post ctx cancelled after pool.Wait. - failedPosts was an int incremented from checkLoop and read from the main goroutine without synchronization. Switched to atomic.Int64 (and Post.failed accordingly). - Removed the misleading `default:` branch from the retry select; with the gating fix it is unreachable and was always wrong. Verified with `go test -race ./internal/poster/... ./internal/processor/...`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the long-standing freeze reported in #168 where uploads hang overnight under large file counts and only recover after a Docker restart. The reporter notes
v0.0.29-rc5was the last stable release; every release since has the bug.A code-level audit of
internal/poster/poster.gorevealed several concurrency and resource-management defects that match the symptoms (silent freeze under volume + time, no panic surfaced). All are fixed in this PR with minimal-diff edits — no architectural rework.Root causes & fixes
postQueueclosed unconditionally right after the initial enqueue loop. Later,checkLoop's retry sendselect { case postQueue <- failedPost: ; default: }could not catch a closed-channel send and panicked. With norecover(), the goroutine died andwg.Wait()blocked forever.postsInFlight.Wait()(the WaitGroup was already wired but never waited). Rebalance the retry path so the original post isDone'd when its retry is queued.post.filewas not closed onpostLooppool-error or verify-exhausted-no-deferred paths. Every failed file leaked an fd; long-running daemons hit the ulimit and stalled on the nextos.Open.post.fileon every terminal path.errChansend was the only one without actx.Done()companion — buffer 1 with multiple writers could deadlock the goroutine.selectand bump buffer to 4.context.WithCancelcancelled afterpool.Wait().failedPostswas anintincremented fromcheckLoopand read from the main goroutine without synchronization.atomic.Int64(andPost.failedaccordingly).default:branch in the retry select claimed to handle a closed channel — sends on closed channels always panic,defaultcannot intercept.Test plan
go build ./...go vet ./internal/poster/...go test -race -count=1 ./internal/poster/... ./internal/processor/...Notes for the reviewer
nntppool v2 → v4upgrade (feat: update nntppool dependency to v4.3.0 and refactor related code #130) and other post-rc5 changes are explicitly out of scope. If hangs persist after this lands, the pool layer is the next place to audit.postsInFlight.Done()for the original post on successful retry-send. Without this,postsInFlight.Wait()would never reach zero andpostQueuewould never close.