transport-discovery: persist latency in a dedicated key, decoupled from registration TTL by 0pcom · Pull Request #2418 · skycoin/skywire

0pcom · 2026-05-03T00:02:25Z

Summary

Latency lived inside the tp:<id> registration blob and inherited its 5-minute entry-timeout. Bandwidth has always been stored at bw:daily:<id>:<date> with a 35-day TTL, so any visor that paused re-registering (TPD restart, network blip, normal churn) silently lost its latency from /metrics until the next CXO push — while bandwidth-today survived intact. After a TPD bounce in production this is observable: bandwidth recovers as visors re-register, latency does not.

This change moves latency to a peer of bandwidth.

What's stored

New key: transport-discovery:lat:<id>, JSON {min, max, avg, updated_at} in microseconds, 35-day TTL (latencyTTL matches bw:daily:* retention).
TransportData.Latency{Min,Max,Avg} stays in the schema so older payloads decode cleanly, but no write touches it anymore.

Write path

UpdateLatency writes only to the new key. The "must be registered" coupling and the TTL-inheriting Set on tp:<id> are gone — those were exactly the reasons latency disappeared with registration churn. avg <= 0 still drops the update. Last-writer-wins, no per-edge tracking, no daily aggregation — only the storage location and lifetime change.

Read paths

getLatencyRecord(ctx, id) plus a batched hydrateDurableLatency([]*Entry) overlay durable values onto entry/entries:

GetTransportMetrics (/metrics endpoint) — reads the new key directly in its existing pipeline.
getAllTransportsWithQoS → GetNetworkMetrics, GetVisorAggregateMetrics — hydrates after scanAllTransports.
GetTransportsByEdge (/transports/edge:, /transports/edges) — hydrates the returned slice.
GetTransportByID (/transports/id:) — single-entry hydration.

Test plan

go build ./... clean.
go vet ./pkg/transport-discovery/... clean.
go test ./pkg/transport-discovery/... ./pkg/transport/ ./pkg/router/ ./pkg/visor/stats/ all pass.
No new unit test added — the store suite runs against newMemoryStore() (where UpdateLatency is a no-op) and there's no miniredis fixture in-repo.
Post-deploy verification:
- redis-cli TTL transport-discovery:lat:<id> returns ~3024000s (35d), never the 5-min registration TTL.
- After a TPD bounce, /metrics carries latency for transports whose visors haven't yet re-pushed a CXO sample.
- /metrics/visor, /metrics/visors, /transports/edge:, /transports/id: reflect the persisted latency.

Sequencing

Independent of #2415 (zero-clobber guards), but the two interact: #2415 prevents partial-zero snapshots from being persisted in the first place, this PR ensures whatever does land is durable.

…om registration TTL Latency lived inside the tp:<id> registration blob and inherited its 5-minute entry-timeout. Bandwidth has always been stored separately at bw:daily:<id>:<date> with a 35-day TTL, so any visor that paused re-registering (TPD restart, network blip, normal churn) silently lost its latency from /metrics until the next CXO push, while bandwidth- today survived intact. Move latency to a peer of bandwidth: - New key: transport-discovery:lat:<id>, JSON {min, max, avg, updated_at} in microseconds, 35-day TTL. - UpdateLatency writes only to that key. The "must be registered" coupling and the TTL-inheriting Set on tp:<id> are gone — those were exactly the reasons latency disappeared with registration churn. avg<=0 still drops the update. - GetTransportMetrics reads the new key in its existing pipeline. - getAllTransportsWithQoS, GetTransportsByEdge, GetTransportByID hydrate entry.Latency from the durable record so the aggregate paths (GetNetworkMetrics, GetVisorAggregateMetrics) and the /transports/id:, /transports/edge:, /transports/edges API endpoints all see the persisted value. TransportData.Latency{Min,Max,Avg} stays in the schema so older payloads decode cleanly, but no write touches them anymore. Reads that go through the QoS hydration step end up with the durable value overlaid on top of the (now always 0) blob field. No semantic change to the value itself: still last-writer-wins, still per-transport (round-trip is symmetric, no per-edge tracking), still no daily aggregation. Only the storage location and lifetime change. Verified post-deploy by: - redis-cli TTL transport-discovery:lat:<id> returns ~3024000s, never the registration TTL. - After a TPD bounce, /metrics carries latency for transports whose visors haven't yet re-pushed a CXO sample.

…ydration

…iminator (#2422) Production TPD was restarting every 30-40s (RestartCount 556 over 6h on the prod host) because two distinct panics tear down the process: 1. pkg/cxo/skyobject/cache.go (*Cache).Finc:1189,1216 panic: "Finc to negative for: <hash>" The filling-refcount went below zero — likely a duplicate Finc on a Filler.incs map, or an Inc/Finc mismatch across overlapping fillers. Hard process kill via panic. Filler.apply / Filler.reject already consume Finc's error return and just log; surface the inconsistency through that path instead. Clamp fc to 0, log the condition with key and the offending inc, and continue. Worst case is a leaked filling-item slot — orders of magnitude better than killing the service. 2. pkg/httputil/httputil.go WriteJSON:50 panic: "short write: i/o deadline reached" isIOError checks errors.Is(err, io.ErrShortWrite), but net/http's timeoutWriter returns its own error value with the same message string when a write deadline expires mid-response. errors.Is misses it (different sentinel), the fallback string match didn't include "short write", so getAllTransports' ~1MB JSON write to a slow client panics on every deadline hit. Added "short write", "i/o timeout", and "deadline exceeded" to the string-match fallback. New TestIsIOErrorShortWriteVariants pins all sentinel and string-match cases. Neither bug is caused by #2415/#2418/#2421; they were just made visible because deploys cycling at 30-40s aren't subtle. Together these stop the panic loop without changing any data semantics.

0pcom added 2 commits May 3, 2026 00:02

fix lint: gofmt blank line + errcheck nolint on best-effort latency h…

19b1a78

…ydration

0pcom merged commit a619f20 into skycoin:develop May 3, 2026
3 of 4 checks passed

This was referenced May 3, 2026

transport: drop outlier RTT samples above MaxReasonableRTTMs (30s) #2421

Merged

fix tpd panic loop: cxo Finc-to-negative + httputil short-write discriminator #2422

Merged

0pcom mentioned this pull request May 4, 2026

transport-discovery: cap UpdateLatency at MaxReasonableRTTMs #2425

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transport-discovery: persist latency in a dedicated key, decoupled from registration TTL#2418

transport-discovery: persist latency in a dedicated key, decoupled from registration TTL#2418
0pcom merged 2 commits intoskycoin:developfrom
0pcom:feat/persistent-latency

0pcom commented May 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

0pcom commented May 3, 2026

Summary

What's stored

Write path

Read paths

Test plan

Sequencing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant