feat(shrex): add resource limitting to shrex server#4898
feat(shrex): add resource limitting to shrex server#4898
Conversation
90d6dba to
d10fac7
Compare
|
Note for the release. This is a config breaking and requires operators to call |
d10fac7 to
b2bed67
Compare
Introduce principled libp2p resource manager limits and exact per-request memory reservation for the shrex server. Memory reservation: - Add ResponseSize(edsSize int) to each shwap ID type so the server can compute the exact bytes it will read before touching the accessor. - Add Size(ctx) to shwap.Accessor to look up the actual EDS square size from the open file. - The handler now calls file.Size() → ResponseSize(edsSize) → ReserveMemory() before ResponseReader(), replacing a flat per-stream constant with an exact per-request budget. Resource limits (new limits.go): - All service and protocol limits are derived from a single source of truth: maxResponseSize at MaxSquareSize (e.g. 32 MiB for a 512-wide square). - streamIncrease = 1 GiB / maxResponseSize keeps stream and memory budgets self-consistent and auto-adjusts if response sizes change. - Service per-peer cap = streamIncrease / minSimultaneousPeers ensures at least N peers can be fully served in parallel. - Protocol per-peer caps are DAS-workload heuristics (256 for SampleID, 16 for namespace/row, 8 for EDS) and fire at stream creation before the handler runs and before any memory is reserved. - Memory is tracked only at the service scope to avoid double-counting with protocol-scope limits. Other: - Fix double-reset: statusResourceExhausted no longer calls s.Reset() after handleDataRequest already called ResetWithError(StreamResourceLimitExceeded). - Per-peer rate limiting is deferred to a follow-up; rcmgr stream limits provide primary backpressure for this change. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3a0d1c1 to
d0fac9a
Compare
The defaults autoscale now with the available memory. The entire configuration is now derived from protocol heuristics and params, and nothing is exposed as configuration. It aims to saturate as much RAM as is provisioned to the process. More RAM, more concurrent peers can be processed. |
When the remote peer's resource manager rejects a stream due to hitting a limit (stream count or memory budget), the libp2p stream reset carries ErrorCode StreamResourceLimitExceeded. Surface this as ErrResourceExhausted in the client and treat it as a temporary overload — cooldown the peer and retry on a different one rather than blacklisting. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
8ffc50e to
554dc92
Compare
| // ResponseSize returns 1 share + NMT proof bytes for the given EDS square size. | ||
| func (sid SampleID) ResponseSize(edsSize int) int { | ||
| return libshare.ShareSize + share.AxisRootSize*int(math.Log2(float64(edsSize))) | ||
| } |
There was a problem hiding this comment.
🟡 ResponseSize for SampleID underestimates memory for resource reservation
The SampleID.ResponseSize formula libshare.ShareSize + share.AxisRootSize*int(math.Log2(float64(edsSize))) estimates only the share data plus the NMT proof nodes along the path. However, the actual serialized Sample written via sample.WriteTo(buf) in sample_id.go:221 includes additional proof structure (sibling hashes at each proof level, proof type metadata, etc.), not just one AxisRootSize per tree level. This means the memory reserved via stream.Scope().ReserveMemory(memReserve, ...) at share/shwap/p2p/shrex/server.go:200 can be less than what the response actually buffers. While this is unlikely to cause an immediate crash (the underlying allocation still happens), it defeats the purpose of the resource manager memory tracking — the actual memory consumed by the response can exceed what was reserved, allowing more concurrent requests than the budget intends and potentially leading to OOM under load.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
The underestimate is ~35 bytes, which is negligible on GiB scale memory budget.
| // ResponseSize returns the ODS half-row size in bytes for the given EDS square size. | ||
| func (rid RowID) ResponseSize(edsSize int) int { | ||
| return edsSize / 2 * libshare.ShareSize | ||
| } |
There was a problem hiding this comment.
🟡 RowID.ResponseSize underestimates by not accounting for proof/metadata bytes
Similar to SampleID, RowID.ResponseSize returns edsSize / 2 * libshare.ShareSize which accounts only for the raw share bytes of the ODS half-row. However, the actual response serialized by row.WriteTo(buf) at row_id.go:156 includes additional metadata (e.g., the axis half type indicator, share proofs). The memory reserved at server.go:200 will be less than the actual buffer size, allowing the resource manager to admit more concurrent requests than the memory budget can handle.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
The underestimate is ~35 bytes, which is negligible on GiB scale memory budget.
Adds libp2p resource manager integration to the shrex server: exact per-request memory reservation and principled, hardware-scaled stream limits.
Exact memory reservation —
ResponseSize(edsSize int)is added to each shwap ID type andSize(ctx)toshwap.Accessor. The handler now callsfile.Size()→ResponseSize(edsSize)→ReserveMemory()beforeResponseReader(), so the resource manager budget reflects real allocations per request type rather than a generic estimate.Resource limits (
limits.go) — all service and protocol limits derive from a single value: the worst-case responsesize across registered request types at
MaxSquareSize(~32 MiB for a 512-wide square). Stream and memory budgets are kept self-consistent viastreamIncrease = 1 GiB / maxResponseSizeand scale with hardware throughBaseLimitIncrease. Protocol per-peer caps (256 forSampleID, 16 for namespace/row, 8 for EDS) are DAS-workload heuristics that fire at stream creation before the handler runs. Memory is tracked at the service scope only, sinceReserveMemoryis called inside the handler afterSetService.Other — fix a double-reset where
s.Reset()was called afterhandleDataRequestalready calledResetWithError(StreamResourceLimitExceeded), silently overwriting the error code. Per-peer rate limiting is deferred to a follow-up. Add client-side handling for limits, ensuring peers are set on cooldown, rather than blocklisted.Closes https://linear.app/celestia/issue/PROTOCO-1326/handle-celestia-246