Skip to content

feat(shrex): add resource limitting to shrex server#4898

Open
Wondertan wants to merge 2 commits intomainfrom
feat/shrex-rate-limiting
Open

feat(shrex): add resource limitting to shrex server#4898
Wondertan wants to merge 2 commits intomainfrom
feat/shrex-rate-limiting

Conversation

@Wondertan
Copy link
Copy Markdown
Member

@Wondertan Wondertan commented Apr 2, 2026

Adds libp2p resource manager integration to the shrex server: exact per-request memory reservation and principled, hardware-scaled stream limits.

Exact memory reservationResponseSize(edsSize int) is added to each shwap ID type and Size(ctx) to shwap.Accessor. The handler now calls file.Size()ResponseSize(edsSize)ReserveMemory() before
ResponseReader(), so the resource manager budget reflects real allocations per request type rather than a generic estimate.

Resource limits (limits.go) — all service and protocol limits derive from a single value: the worst-case response
size across registered request types at MaxSquareSize (~32 MiB for a 512-wide square). Stream and memory budgets are kept self-consistent via streamIncrease = 1 GiB / maxResponseSize and scale with hardware through BaseLimitIncrease. Protocol per-peer caps (256 for SampleID, 16 for namespace/row, 8 for EDS) are DAS-workload heuristics that fire at stream creation before the handler runs. Memory is tracked at the service scope only, since ReserveMemory is called inside the handler after SetService.

Other — fix a double-reset where s.Reset() was called after handleDataRequest already called ResetWithError(StreamResourceLimitExceeded), silently overwriting the error code. Per-peer rate limiting is deferred to a follow-up. Add client-side handling for limits, ensuring peers are set on cooldown, rather than blocklisted.

Closes https://linear.app/celestia/issue/PROTOCO-1326/handle-celestia-246

@Wondertan Wondertan requested a review from a team as a code owner April 2, 2026 14:47
@Wondertan Wondertan requested a review from ninabarbakadze April 2, 2026 14:47
@Wondertan Wondertan self-assigned this Apr 2, 2026
@Wondertan Wondertan force-pushed the feat/shrex-rate-limiting branch from 90d6dba to d10fac7 Compare April 2, 2026 14:51
@Wondertan Wondertan changed the title feat(shrex): add rate limiting and resource management to shrex server feat(shrex)!: add rate limiting and resource management to shrex server Apr 2, 2026
@Wondertan
Copy link
Copy Markdown
Member Author

Note for the release. This is a config breaking and requires operators to call config-update manually.

@Wondertan Wondertan force-pushed the feat/shrex-rate-limiting branch from d10fac7 to b2bed67 Compare April 2, 2026 14:56
Introduce principled libp2p resource manager limits and exact per-request
memory reservation for the shrex server.

Memory reservation:
- Add ResponseSize(edsSize int) to each shwap ID type so the server can
  compute the exact bytes it will read before touching the accessor.
- Add Size(ctx) to shwap.Accessor to look up the actual EDS square size
  from the open file.
- The handler now calls file.Size() → ResponseSize(edsSize) → ReserveMemory()
  before ResponseReader(), replacing a flat per-stream constant with an
  exact per-request budget.

Resource limits (new limits.go):
- All service and protocol limits are derived from a single source of
  truth: maxResponseSize at MaxSquareSize (e.g. 32 MiB for a 512-wide square).
- streamIncrease = 1 GiB / maxResponseSize keeps stream and memory budgets
  self-consistent and auto-adjusts if response sizes change.
- Service per-peer cap = streamIncrease / minSimultaneousPeers ensures at
  least N peers can be fully served in parallel.
- Protocol per-peer caps are DAS-workload heuristics (256 for SampleID,
  16 for namespace/row, 8 for EDS) and fire at stream creation before the
  handler runs and before any memory is reserved.
- Memory is tracked only at the service scope to avoid double-counting
  with protocol-scope limits.

Other:
- Fix double-reset: statusResourceExhausted no longer calls s.Reset() after
  handleDataRequest already called ResetWithError(StreamResourceLimitExceeded).
- Per-peer rate limiting is deferred to a follow-up; rcmgr stream limits
  provide primary backpressure for this change.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Wondertan Wondertan force-pushed the feat/shrex-rate-limiting branch from 3a0d1c1 to d0fac9a Compare April 3, 2026 18:25
@Wondertan
Copy link
Copy Markdown
Member Author

  • Does defaults meet the product expectations?
  • Do we want to make all, subset, or none of the rate tunables with the config? Currently, all are exposed.

The defaults autoscale now with the available memory. The entire configuration is now derived from protocol heuristics and params, and nothing is exposed as configuration. It aims to saturate as much RAM as is provisioned to the process. More RAM, more concurrent peers can be processed.

@Wondertan Wondertan changed the title feat(shrex)!: add rate limiting and resource management to shrex server feat(shrex): add rate limiting and resource management to shrex server Apr 3, 2026
@Wondertan Wondertan changed the title feat(shrex): add rate limiting and resource management to shrex server feat(shrex): add resource limitting to shrex server Apr 3, 2026
When the remote peer's resource manager rejects a stream due to hitting
a limit (stream count or memory budget), the libp2p stream reset carries
ErrorCode StreamResourceLimitExceeded. Surface this as ErrResourceExhausted
in the client and treat it as a temporary overload — cooldown the peer and
retry on a different one rather than blacklisting.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Wondertan Wondertan force-pushed the feat/shrex-rate-limiting branch from 8ffc50e to 554dc92 Compare April 3, 2026 19:45
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View 16 additional findings in Devin Review.

Open in Devin Review

Comment on lines +209 to +212
// ResponseSize returns 1 share + NMT proof bytes for the given EDS square size.
func (sid SampleID) ResponseSize(edsSize int) int {
return libshare.ShareSize + share.AxisRootSize*int(math.Log2(float64(edsSize)))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 ResponseSize for SampleID underestimates memory for resource reservation

The SampleID.ResponseSize formula libshare.ShareSize + share.AxisRootSize*int(math.Log2(float64(edsSize))) estimates only the share data plus the NMT proof nodes along the path. However, the actual serialized Sample written via sample.WriteTo(buf) in sample_id.go:221 includes additional proof structure (sibling hashes at each proof level, proof type metadata, etc.), not just one AxisRootSize per tree level. This means the memory reserved via stream.Scope().ReserveMemory(memReserve, ...) at share/shwap/p2p/shrex/server.go:200 can be less than what the response actually buffers. While this is unlikely to cause an immediate crash (the underlying allocation still happens), it defeats the purpose of the resource manager memory tracking — the actual memory consumed by the response can exceed what was reserved, allowing more concurrent requests than the budget intends and potentially leading to OOM under load.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The underestimate is ~35 bytes, which is negligible on GiB scale memory budget.

Comment on lines +143 to +146
// ResponseSize returns the ODS half-row size in bytes for the given EDS square size.
func (rid RowID) ResponseSize(edsSize int) int {
return edsSize / 2 * libshare.ShareSize
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 RowID.ResponseSize underestimates by not accounting for proof/metadata bytes

Similar to SampleID, RowID.ResponseSize returns edsSize / 2 * libshare.ShareSize which accounts only for the raw share bytes of the ODS half-row. However, the actual response serialized by row.WriteTo(buf) at row_id.go:156 includes additional metadata (e.g., the axis half type indicator, share proofs). The memory reserved at server.go:200 will be less than the actual buffer size, allowing the resource manager to admit more concurrent requests than the memory budget can handle.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The underestimate is ~35 bytes, which is negligible on GiB scale memory budget.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant