transport: drop outlier RTT samples above MaxReasonableRTTMs (30s)#2421
Merged
0pcom merged 1 commit intoskycoin:developfrom May 3, 2026
Merged
Conversation
Production observation: three sudph transports persisted with
min/avg ~190ms and max ~33-35 SECONDS in /metrics. The min/avg are
correct; the max is a single straggler pong arriving long after its
ping (delayed packet, host suspended, NIC queue stall). Once it
lands, Max only ever grows, so a single bad sample pins the
transport's Max for the life of the lat:<id> key (35 days).
handleTransportPong correlates pongs to pings only by timestamp —
there's no in-flight tracking that would discard a pong arriving N
seconds after timeout. Real intercontinental RTT plus heavy
queueing is comfortably under 5s, so 30s is a generous upper bound
for "obviously bogus".
Mirror the existing zero-clobber lower-bound guard with an
upper-bound at every layer:
- transport.MaxReasonableRTTMs (exported, 30_000.0).
- SetLatency drops latencyMs > MaxReasonableRTTMs (alongside <=0).
- SetLatencyStats rejects the whole snapshot if any of min/max/avg
exceeds the bound.
- handleTransportPong short-circuits before SetLatency, with a
Trace log of the dropped value.
- RouteGroup.MeasureLatency drops samples >MaxReasonableRTTMs from
`measurements` so a stale pong from the periodic pingLoop
leaking into the buffered pongCh can't poison the active
measurement's min/max/avg.
- RouteGroup.handlePongPacket rejects up front so neither
networkStats, the synchronous MeasureLatency consumer, nor the
underlying transport see the bogus value.
TestSetLatencyDropsOutliers and TestSetLatencyStatsDropsOutliers
pin both the rejection and the boundary-exact acceptance.
6 tasks
0pcom
added a commit
that referenced
this pull request
May 4, 2026
…iminator (#2422) Production TPD was restarting every 30-40s (RestartCount 556 over 6h on the prod host) because two distinct panics tear down the process: 1. pkg/cxo/skyobject/cache.go (*Cache).Finc:1189,1216 panic: "Finc to negative for: <hash>" The filling-refcount went below zero — likely a duplicate Finc on a Filler.incs map, or an Inc/Finc mismatch across overlapping fillers. Hard process kill via panic. Filler.apply / Filler.reject already consume Finc's error return and just log; surface the inconsistency through that path instead. Clamp fc to 0, log the condition with key and the offending inc, and continue. Worst case is a leaked filling-item slot — orders of magnitude better than killing the service. 2. pkg/httputil/httputil.go WriteJSON:50 panic: "short write: i/o deadline reached" isIOError checks errors.Is(err, io.ErrShortWrite), but net/http's timeoutWriter returns its own error value with the same message string when a write deadline expires mid-response. errors.Is misses it (different sentinel), the fallback string match didn't include "short write", so getAllTransports' ~1MB JSON write to a slow client panics on every deadline hit. Added "short write", "i/o timeout", and "deadline exceeded" to the string-match fallback. New TestIsIOErrorShortWriteVariants pins all sentinel and string-match cases. Neither bug is caused by #2415/#2418/#2421; they were just made visible because deploys cycling at 30-40s aren't subtle. Together these stop the panic loop without changing any data semantics.
5 tasks
0pcom
added a commit
that referenced
this pull request
May 4, 2026
Visors running pre-#2421 binaries push outlier max samples (e.g. 35s RTT from a straggler pong) via CXO. TPD's UpdateLatency only had a lower-bound guard (avg<=0 or any field <=0 → drop), so those reach the lat:<id> store and pin Max for the 35-day retention regardless of how many later good samples land. Production right now: 5 transports show max>30s, all written ~10h ago, ages-out date 35d from each write. Won't go away on its own. Reject the same way the visor side does (transport.MaxReasonableRTTMs, 30s) so TPD's defense-in-depth matches the visor's. Old visors keep pushing bogus values; TPD now silently drops them, and the next good sample from any peer overwrites the stale stored max. The package already imports pkg/transport (via redis_transport.go) so MaxReasonableRTTMs comes free with no new dep.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Production observation post-#2415/#2418: three sudph transports show
min/avg ~190msbutmax ~33-35 secondsin TPD's/metrics. The min/avg are correct; the max is a single straggler pong arriving long after its ping —handleTransportPongcorrelates pong to ping only by timestamp, with no in-flight tracking that would discard a pong arriving after the implied timeout. Once it lands,Maxonly ever grows, so a single bad sample pins the transport's Max for the life of thelat:<id>key (35 days).Real intercontinental RTT plus heavy queueing is comfortably under 5s, so 30s is a generous upper bound for "obviously bogus".
Fix
Mirror the existing zero-clobber lower-bound guard with an upper-bound at every layer:
transport.MaxReasonableRTTMs— exported constant,30_000.0ms.SetLatencydropslatencyMs > MaxReasonableRTTMs(alongside the existing<= 0).SetLatencyStatsrejects the whole snapshot if any of min/max/avg exceeds the bound.handleTransportPongshort-circuits beforeSetLatency, Trace-logs the dropped value.RouteGroup.MeasureLatencydrops outlier samples frommeasurementsso a stale pong from the periodicpingLooplanding in the bufferedpongChcan't poison an active measurement's min/max/avg.RouteGroup.handlePongPacketrejects up front so neithernetworkStats, the synchronouspendingPongChconsumer, nor the underlying transport see the bogus value.Test plan
go build ./...clean.go vet ./pkg/transport/... ./pkg/router/...clean.go test ./pkg/transport/ ./pkg/router/ ./pkg/transport-discovery/...all pass.gofmtclean.TestSetLatencyDropsOutliers— rejects the 35s production case, accepts exactly at the bound, valid samples after rejection still update the running stats.TestSetLatencyStatsDropsOutliers— table test rejects partial-outlier shapes (over-bound min/max/avg in any combination), accepts a fully-in-bound snapshot.skywire cli tp metrics --by-transport: no transport'slatency.maxexceeds 30,000 ms; production cases like51eced2f,f4c0e8c7,f95d6019(currently min/avg ~190ms, max ~32-35s) settle to plausible max within ~5 minutes of a CXO push.Followup to
#2415 (zero-clobber) and #2418 (persistent latency). The 35s outliers were the first thing visible once latency started persisting, since transient bad samples used to disappear with the registration TTL — they now stay around for 35 days.