Skip to content

transport-discovery: cap UpdateLatency at MaxReasonableRTTMs#2425

Merged
0pcom merged 1 commit intoskycoin:developfrom
0pcom:fix/tpd-latency-upper-bound-ingest
May 4, 2026
Merged

transport-discovery: cap UpdateLatency at MaxReasonableRTTMs#2425
0pcom merged 1 commit intoskycoin:developfrom
0pcom:fix/tpd-latency-upper-bound-ingest

Conversation

@0pcom
Copy link
Copy Markdown
Collaborator

@0pcom 0pcom commented May 4, 2026

Summary

Defense-in-depth follow-up to #2421 (visor-side outlier guard) and #2418 (lat: persistence). Production observation:

Fix

Extend UpdateLatency's existing min/max/avg <= 0 guard with > transport.MaxReasonableRTTMs (the same 30s threshold the visor side uses post-#2421). The store package already imports pkg/transport via redis_transport.go, so the constant is free.

if minMS <= 0 || maxMS <= 0 || avgMS <= 0 ||
    minMS > transport.MaxReasonableRTTMs ||
    maxMS > transport.MaxReasonableRTTMs ||
    avgMS > transport.MaxReasonableRTTMs {
    return nil
}

After this lands, old visors keep pushing bogus values, TPD silently drops them, and the next good sample from any peer overwrites the stale stored max — clearing the production outliers without waiting for visor rollout.

Test plan

  • go build ./... clean.
  • go vet ./pkg/transport-discovery/... clean.
  • go test ./pkg/transport-discovery/... all pass.
  • gofmt clean.
  • Post-deploy: skywire cli tp metrics --by-transport --json | jq '[.[] | select(.latency.max > 30000000)] | length' reaches 0 within ~5 min of any peer pushing a fresh sample for the affected transports.

Notes

This is purely the symmetric ingest-side cap. The aggregator's dispatchLeaf already gates on > 0 for all three fields; could grow the same upper-bound check there for consistency, but that needs a new pkg/transport import in cxoaggregator and the store-side guard alone is sufficient since dispatchLeaf is the only writer that calls UpdateLatency.

Visors running pre-skycoin#2421 binaries push outlier max samples (e.g. 35s
RTT from a straggler pong) via CXO. TPD's UpdateLatency only had a
lower-bound guard (avg<=0 or any field <=0 → drop), so those reach
the lat:<id> store and pin Max for the 35-day retention regardless
of how many later good samples land.

Production right now: 5 transports show max>30s, all written ~10h
ago, ages-out date 35d from each write. Won't go away on its own.

Reject the same way the visor side does (transport.MaxReasonableRTTMs,
30s) so TPD's defense-in-depth matches the visor's. Old visors keep
pushing bogus values; TPD now silently drops them, and the next good
sample from any peer overwrites the stale stored max.

The package already imports pkg/transport (via redis_transport.go)
so MaxReasonableRTTMs comes free with no new dep.
@0pcom 0pcom merged commit 969ce9c into skycoin:develop May 4, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant