transport-discovery: cap UpdateLatency at MaxReasonableRTTMs#2425
Merged
0pcom merged 1 commit intoskycoin:developfrom May 4, 2026
Merged
transport-discovery: cap UpdateLatency at MaxReasonableRTTMs#24250pcom merged 1 commit intoskycoin:developfrom
0pcom merged 1 commit intoskycoin:developfrom
Conversation
Visors running pre-skycoin#2421 binaries push outlier max samples (e.g. 35s RTT from a straggler pong) via CXO. TPD's UpdateLatency only had a lower-bound guard (avg<=0 or any field <=0 → drop), so those reach the lat:<id> store and pin Max for the 35-day retention regardless of how many later good samples land. Production right now: 5 transports show max>30s, all written ~10h ago, ages-out date 35d from each write. Won't go away on its own. Reject the same way the visor side does (transport.MaxReasonableRTTMs, 30s) so TPD's defense-in-depth matches the visor's. Old visors keep pushing bogus values; TPD now silently drops them, and the next good sample from any peer overwrites the stale stored max. The package already imports pkg/transport (via redis_transport.go) so MaxReasonableRTTMs comes free with no new dep.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Defense-in-depth follow-up to #2421 (visor-side outlier guard) and #2418 (lat: persistence). Production observation:
124,156msworst — all written ~10h ago. They'll persist 35 days because the visor pushing them isn't running transport: drop outlier RTT samples above MaxReasonableRTTMs (30s) #2421's outlier filter, and TPD has no upper-bound check to drop them at ingest.Fix
Extend
UpdateLatency's existingmin/max/avg <= 0guard with> transport.MaxReasonableRTTMs(the same 30s threshold the visor side uses post-#2421). The store package already importspkg/transportviaredis_transport.go, so the constant is free.After this lands, old visors keep pushing bogus values, TPD silently drops them, and the next good sample from any peer overwrites the stale stored max — clearing the production outliers without waiting for visor rollout.
Test plan
go build ./...clean.go vet ./pkg/transport-discovery/...clean.go test ./pkg/transport-discovery/...all pass.gofmtclean.skywire cli tp metrics --by-transport --json | jq '[.[] | select(.latency.max > 30000000)] | length'reaches 0 within ~5 min of any peer pushing a fresh sample for the affected transports.Notes
This is purely the symmetric ingest-side cap. The aggregator's
dispatchLeafalready gates on> 0for all three fields; could grow the same upper-bound check there for consistency, but that needs a newpkg/transportimport incxoaggregatorand the store-side guard alone is sufficient sincedispatchLeafis the only writer that callsUpdateLatency.