cxo: fix Cache.cleanDown eviction bug and skip re-encoding unchanged sub-trees#2420
Merged
0pcom merged 2 commits intoskycoin:developfrom May 3, 2026
Merged
Conversation
…lling item The filter loop's "skip wanted" / "skip filling" comments said skip, but the code returned, aborting the entire ranking on the first non-evictable entry encountered during map iteration. On a busy visor there is virtually always something filling or being awaited, so the cache never evicted anything: putItem invoked cleanDown once the budget was exceeded, cleanDown built a len(c.is)-sized rank slice and immediately discarded it, and putItem then inserted the new item on top of an already-over-budget cache. The discarded slice is what the alloc profile pinned at 95% of total bytes allocated under cleanDown, and the un-evicted backlog is what pinned the live heap under putItem. continue, not return. While here, switch the rank slice from []*rankItem to []rankItem so we don't allocate a tiny heap object per cache entry.
Publisher.publishRoot re-walked the entire in-memory tree on every dirty batch and called encoder.Serialize + entry.Sub.SetValue at every level, even when only a single leaf had changed. With the visor stats publisher running at 1-min cadence this dominated CPU (via GC pressure from the Serialize churn) and live heap (the encoded TreeNode bytes piling up in Cache.putItem). Add a per-memNode encode-cache (pubHash + cached) that is populated after each successful Save and invalidated by putAt / deleteAt / pruneAt along the touched path. encodeNode now places the cached hash directly into the parent's TreeEntry for any unchanged sub-node and skips the recursive walk plus the encoder.Serialize / Cache.Set chain entirely. Container.Save's reference walk increments the rc on the cached hash without descending into it (skyobject/unpack.go:156-167), so the underlying CXO objects stay alive across publishes as long as each successive Root references the same hash — exactly the content-addressed behavior the package doc already promised. Cache promotion is deferred until after Save returns nil so an aborted publish doesn't leave the cache pointing at hashes that up.Close has rolled back. Adds TestEncodeCacheInvalidatesOnlyAffectedPath to pin the sibling-preservation contract.
This was referenced May 4, 2026
0pcom
added a commit
that referenced
this pull request
May 4, 2026
…y from stats bbolt (#2427) The per-transport CSV log store wrote a row to a daily file on every packet via gocsv reflection-based marshalling. On a public visor that turned into the dominant alloc-rate symbol — pprof showed transport.fileTransportLogStore.writeToCSV at ~78% cumulative bytes allocated, dwarfing every other hot path once the CXO publisher's own allocation churn was fixed (#2420). The CSV files were also narrower than the bbolt-backed stats store that already exists (no latency, less precise timestamps), and the stats store was already plumbed into the CXO publisher feeding TPD's aggregator. Drop the file-backed log stores entirely: - pkg/transport/log.go — remove fileTransportLogStore and the parallel fileLatencyLogStore (the latter was wired into ManagerConfig but never read by anything — pure dead weight). CsvEntry / LatencyCsvEntry gocsv structs go with them. The in-memory LogStore stays — ManagedTransport uses it to preserve cumulative byte counters across same-session reconnects of the same transport ID. - pkg/transport/manager.go — drop the unused LatencyLogStore field on ManagerConfig. - pkg/visor/init_transport.go — collapse the file/memory branch to always-memory; drop the latency-store init. LogStore.Type=file in existing visor configs is accepted but logged as deprecated. - pkg/visor/api_transport.go — re-implement GetTransportLogs to walk stats.Store.AllTransportRecords() and emit one entry per (transport, day) within the requested window. The CSV path's RecvBytes/SentBytes was the within-day delta (CSVs were zero-anchored at each day rollover), which matches DailyRollup semantics exactly — `skywire-cli tp -b N` aggregation is unchanged. Drop the local csvEntry parser and readTransportLogFile. - pkg/visor/logserver/api.go — drop the /transport_logs/:file HTTP route and its landing-page listing. Same data is served by /stats/transports/history (live snapshot at /stats/transports). The unused tpLogPath parameter on logserver.New stays in the signature for call-site compat (init_dmsg.go passes it). - pkg/skyenv/skyenv.go — drop the unused LatencyLogStore directory constant; leave TpLogStore in place since gen.go still emits it as the LogStore.Location default for backward-compatible configs (the runtime ignores the path — nothing is written there). Cumulative bytes counters on per-transport status (skywire-cli tp ls, hypervisor UI) still reflect "bytes since transport became live in this visor session" — same in-memory semantics that file-mode users also got within a single session. Across visor restarts everyone starts fresh, which is what memory-mode users were already experiencing. Tests: package tests pass (transport, visor, visor/logserver, visor/stats, visor/visorconfig). The only remaining file-mode test (TestFileTransportLogStore) is removed along with the implementation.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two CXO fixes that together drop a public visor's GC CPU share from ~60% to baseline and substantially reduce live heap. Found via pprof on a public visor running the user-feeds + visor-stats publishers; both fixes apply to every CXO node in the network, not just the visor.
1.
cxo/skyobject:Cache.cleanDownnever actually evicted anythingpkg/cxo/skyobject/cache.go:341,345— the filter loop's "skip wanted" / "skip filling" comments said skip, but the codereturned, aborting the entire ranking on the first non-evictable entry encountered during map iteration. On a busy node there is virtually always something filling or being awaited, so the cache never evicted:putIteminvokedcleanDownonce the budget was exceeded,cleanDownbuilt alen(c.is)-sized rank slice and immediately discarded it, andputItemthen inserted the new item on top of an already-over-budget cache.The discarded slice was 95.6% of total bytes allocated under
cleanDownin the alloc profile; the un-evicted backlog was the dominant share of live heap underputItem. Fix:continue, notreturn. Also switched the rank slice to[]rankItem(value-slice) to drop a per-entry heap allocation.This is a pure bug fix and benefits every
skyobject.Container— every CXO publisher, subscriber, the TPD aggregator, and the standalonecxodaemon.2.
cxo/treestore: cache encoded sub-node hashes across publishesPublisher.publishRootre-walked the entire in-memory tree on every dirty batch and ranencoder.Serialize+entry.Sub.SetValueat every level, even when only a single leaf had changed. The package doc already promised content-addressed re-use of unchanged sub-trees ("only the O(depth) chain of objects from the modified leaf up to the root gets re-encoded"); the implementation didn't match.Adds a per-
memNodeencode-cache (pubHash+cached) that's populated after each successfulSaveand invalidated byputAt/deleteAt/pruneAtalong the touched path.encodeNodenow places the cached hash directly into the parent'sTreeEntryfor any unchanged sub-node, skipping the recursive walk and theencoder.Serialize/Cache.Setchain entirely.Container.Save's reference walk increments the rc on the cached hash without descending into it (skyobject/unpack.go:156-167), so the underlying CXO objects stay alive across publishes as long as each successive Root references the same hash.Cache promotion is deferred until after
Savereturns nil so an aborted publish doesn't leave the cache pointing at hashes thatup.Closehas rolled back.Benefits every
treestore.Publisher—pkg/visor/init_stats.go,pkg/visor/cxo_user_feeds.go,cmd/apps/skychat/pairing/pair.go.Test plan
go test ./pkg/cxo/...(treestore, skyobject, node, data, idxdb, cxds — all green)go build ./...TestEncodeCacheInvalidatesOnlyAffectedPathpins the sibling-preservation contract: a Put under one branch must not change pubHashes on unrelated subtrees, andcachedmust clear only on the touched path.TestPublisherUnchangedSubtreeKeepsHashstill passes — the doc-comment guarantee is now actively enforced by the cache fast path rather than only emergent from CXO content-addressing.Cache.putItemlive heap shrinks.