transport-discovery: persist latency in a dedicated key, decoupled from registration TTL#2418
Merged
0pcom merged 2 commits intoskycoin:developfrom May 3, 2026
Merged
Conversation
…om registration TTL
Latency lived inside the tp:<id> registration blob and inherited its
5-minute entry-timeout. Bandwidth has always been stored separately
at bw:daily:<id>:<date> with a 35-day TTL, so any visor that paused
re-registering (TPD restart, network blip, normal churn) silently lost
its latency from /metrics until the next CXO push, while bandwidth-
today survived intact.
Move latency to a peer of bandwidth:
- New key: transport-discovery:lat:<id>, JSON {min, max, avg,
updated_at} in microseconds, 35-day TTL.
- UpdateLatency writes only to that key. The "must be registered"
coupling and the TTL-inheriting Set on tp:<id> are gone — those
were exactly the reasons latency disappeared with registration
churn. avg<=0 still drops the update.
- GetTransportMetrics reads the new key in its existing pipeline.
- getAllTransportsWithQoS, GetTransportsByEdge, GetTransportByID
hydrate entry.Latency from the durable record so the aggregate
paths (GetNetworkMetrics, GetVisorAggregateMetrics) and the
/transports/id:, /transports/edge:, /transports/edges API
endpoints all see the persisted value.
TransportData.Latency{Min,Max,Avg} stays in the schema so older
payloads decode cleanly, but no write touches them anymore. Reads
that go through the QoS hydration step end up with the durable
value overlaid on top of the (now always 0) blob field.
No semantic change to the value itself: still last-writer-wins,
still per-transport (round-trip is symmetric, no per-edge
tracking), still no daily aggregation. Only the storage location
and lifetime change.
Verified post-deploy by:
- redis-cli TTL transport-discovery:lat:<id> returns ~3024000s,
never the registration TTL.
- After a TPD bounce, /metrics carries latency for transports
whose visors haven't yet re-pushed a CXO sample.
This was referenced May 3, 2026
0pcom
added a commit
that referenced
this pull request
May 4, 2026
…iminator (#2422) Production TPD was restarting every 30-40s (RestartCount 556 over 6h on the prod host) because two distinct panics tear down the process: 1. pkg/cxo/skyobject/cache.go (*Cache).Finc:1189,1216 panic: "Finc to negative for: <hash>" The filling-refcount went below zero — likely a duplicate Finc on a Filler.incs map, or an Inc/Finc mismatch across overlapping fillers. Hard process kill via panic. Filler.apply / Filler.reject already consume Finc's error return and just log; surface the inconsistency through that path instead. Clamp fc to 0, log the condition with key and the offending inc, and continue. Worst case is a leaked filling-item slot — orders of magnitude better than killing the service. 2. pkg/httputil/httputil.go WriteJSON:50 panic: "short write: i/o deadline reached" isIOError checks errors.Is(err, io.ErrShortWrite), but net/http's timeoutWriter returns its own error value with the same message string when a write deadline expires mid-response. errors.Is misses it (different sentinel), the fallback string match didn't include "short write", so getAllTransports' ~1MB JSON write to a slow client panics on every deadline hit. Added "short write", "i/o timeout", and "deadline exceeded" to the string-match fallback. New TestIsIOErrorShortWriteVariants pins all sentinel and string-match cases. Neither bug is caused by #2415/#2418/#2421; they were just made visible because deploys cycling at 30-40s aren't subtle. Together these stop the panic loop without changing any data semantics.
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Latency lived inside the
tp:<id>registration blob and inherited its 5-minute entry-timeout. Bandwidth has always been stored atbw:daily:<id>:<date>with a 35-day TTL, so any visor that paused re-registering (TPD restart, network blip, normal churn) silently lost its latency from/metricsuntil the next CXO push — while bandwidth-today survived intact. After a TPD bounce in production this is observable: bandwidth recovers as visors re-register, latency does not.This change moves latency to a peer of bandwidth.
What's stored
transport-discovery:lat:<id>, JSON{min, max, avg, updated_at}in microseconds, 35-day TTL (latencyTTLmatchesbw:daily:*retention).TransportData.Latency{Min,Max,Avg}stays in the schema so older payloads decode cleanly, but no write touches it anymore.Write path
UpdateLatencywrites only to the new key. The "must be registered" coupling and the TTL-inheritingSetontp:<id>are gone — those were exactly the reasons latency disappeared with registration churn.avg <= 0still drops the update. Last-writer-wins, no per-edge tracking, no daily aggregation — only the storage location and lifetime change.Read paths
getLatencyRecord(ctx, id)plus a batchedhydrateDurableLatency([]*Entry)overlay durable values onto entry/entries:GetTransportMetrics(/metricsendpoint) — reads the new key directly in its existing pipeline.getAllTransportsWithQoS→GetNetworkMetrics,GetVisorAggregateMetrics— hydrates afterscanAllTransports.GetTransportsByEdge(/transports/edge:,/transports/edges) — hydrates the returned slice.GetTransportByID(/transports/id:) — single-entry hydration.Test plan
go build ./...clean.go vet ./pkg/transport-discovery/...clean.go test ./pkg/transport-discovery/... ./pkg/transport/ ./pkg/router/ ./pkg/visor/stats/all pass.newMemoryStore()(whereUpdateLatencyis a no-op) and there's no miniredis fixture in-repo.redis-cli TTL transport-discovery:lat:<id>returns ~3024000s (35d), never the 5-min registration TTL./metricscarries latency for transports whose visors haven't yet re-pushed a CXO sample./metrics/visor,/metrics/visors,/transports/edge:,/transports/id:reflect the persisted latency.Sequencing
Independent of #2415 (zero-clobber guards), but the two interact: #2415 prevents partial-zero snapshots from being persisted in the first place, this PR ensures whatever does land is durable.