Skip to content

cxo/node: guard fillHead nil-deref races (closeFiller + handleDelConn)#2423

Merged
0pcom merged 1 commit intoskycoin:developfrom
0pcom:fix/cxo-fillhead-nil-deref
May 4, 2026
Merged

cxo/node: guard fillHead nil-deref races (closeFiller + handleDelConn)#2423
0pcom merged 1 commit intoskycoin:developfrom
0pcom:fix/cxo-fillhead-nil-deref

Conversation

@0pcom
Copy link
Copy Markdown
Collaborator

@0pcom 0pcom commented May 4, 2026

Summary

Follow-up to #2422. After the Finc-clamp + short-write fixes deployed, prod TPD's RestartCount dropped from ~556/6h to 2/10min. The two remaining panics are nil-derefs in pkg/cxo/node/head.go:

1. createFiller line 338 — cr.c.String() on nil *Conn

panic: runtime error: invalid memory address or nil pointer dereference
github.com/skycoin/skywire/pkg/cxo/node.(*Conn).String(0x0)        conn.go:225
github.com/skycoin/skywire/pkg/cxo/node.(*fillHead).createFiller   head.go:338
github.com/skycoin/skywire/pkg/cxo/node.(*fillHead).handleFillingResult head.go:416

handleDelConn (head.go:517) sets f.p.c = nil when the source connection drops. f.p.r remains non-zero, so the f.p == (connRoot{}) zero-value gate at handleFillingResult does not catch the nil-c case, and createFiller(f.p) runs with a nil source — cr.c.String() panics on the first line.

Fix: nil-c short-circuit at the top of createFiller. The fill can't proceed without a source connection — log and return.

2. handleSuccess line 228 (and three parallel sites) — PushBack on nil *list.List

panic: runtime error: invalid memory address or nil pointer dereference
container/list.(*List).PushBack(...)
github.com/skycoin/skywire/pkg/cxo/node.(*fillHead).handleSuccess  head.go:228

closeFiller (head.go:384) sets f.rqo, f.fc, f.rq = nil, nil, nil. Any in-flight Filler-goroutine messages still draining through f.successq / f.requestq / f.failureq after the close land on a nil list and panic.

Fix: nil-guard each PushBack/PushFront site (4 sites: handleRequest, handleSuccess, handleRequestFailure, handleReceivedRoot's same-Seq branch). Same recovery as #2422's panic→drop pattern: the closed filler's work is no longer relevant.

What this isn't

The "real" fix is to drain or signal-shutdown the channels before nilling state — that's a larger structural change to the fillHead's close protocol. These guards stop the process kill while preserving observability so a follow-up PR can address the underlying race without prod restarting every few minutes.

Test plan

  • go build ./... clean.
  • go vet ./pkg/cxo/... clean.
  • go test ./pkg/cxo/... (including pkg/cxo/node's own 8s suite) all pass.
  • gofmt clean.
  • Post-deploy: docker logs transport-discovery --since 10m | grep -c panic: reaches 0 and RestartCount stops climbing.

Sequencing

Sits cleanly on top of #2422 (already merged into develop at 80fef7159).

skycoin#2422 silenced the two dominant TPD panics ('Finc to negative',
'short write: i/o deadline reached'). RestartCount went from
~556/6h to 2/10min. The remaining two panics are nil-derefs in
pkg/cxo/node/head.go from torn-down fillHead state:

1. createFiller line 338: cr.c.String() panics when cr.c is nil.
   handleDelConn (line 517) sets f.p.c = nil when the source
   connection is removed. f.p.r remains non-zero, so the
   f.p == (connRoot{}) gate at handleFillingResult does NOT catch
   the nil-c case, and createFiller(f.p) runs with a nil source.
   Fix: nil-c short-circuit at the top of createFiller. The fill
   can't proceed without a source connection — log and return.

2. handleSuccess line 228 (and the parallel sites in handleRequest,
   handleRequestFailure, handleReceivedRoot): f.fc.PushBack on a
   nil *list.List. closeFiller (line 384) sets
   `f.rqo, f.fc, f.rq = nil, nil, nil`, so any in-flight Filler
   goroutine messages that drain after close land on a nil list.
   Fix: nil-guard each PushBack/PushFront site (4 sites total).
   Same recovery as skycoin#2422's panic→drop pattern: the closed filler's
   work is no longer relevant.

These are the actual races; the right "real" fix would be to drain
or signal-shutdown the channels before nilling state, but that's a
larger structural change. Guards stop the process kill while
preserving observability.
@0pcom 0pcom merged commit 36cf860 into skycoin:develop May 4, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant