Skip to content

Update seastar for v25.3.x#228

Merged
piyushredpanda merged 86 commits intoredpanda-data:v25.3.xfrom
BenPope:v25.3.x-update
Aug 11, 2025
Merged

Update seastar for v25.3.x#228
piyushredpanda merged 86 commits intoredpanda-data:v25.3.xfrom
BenPope:v25.3.x-update

Conversation

@BenPope
Copy link
Copy Markdown

@BenPope BenPope commented Aug 6, 2025

Notes for reviewer:

  • Many commits reordered compared to previous branches to colocate similar changes and fixups
  • All the "old" OpenSSL work was ripped out, and replaced with tls: Introduce OpenSSL scylladb/seastar#2569
    • Moved some stuff after these commits to make future rebases easier, and keep the changes more in sync

mmaslankaprv and others added 19 commits August 6, 2025 13:55
Seastar http server implementation supports multiple listeners. It may
be required for the handler logic to know which listener the connection
is coming from. Added listener_idx field to `httpd::request` to allow
handler recognize listener.

Signed-off-by: Michal Maslanka <michal@vectorized.io>
Since an exception carries some text for the response
body text, the raising site might like to specify
the content type if it's e.g. json.

Signed-off-by: John Spray <jcs@vectorized.io>
This enables throwing a base_exception from
a json request handler with a json payload
inside it.

Signed-off-by: John Spray <jcs@vectorized.io>
Signed-off-by: John Spray <jcs@vectorized.io>
Prior to this patch seastar only exposes one global metrics::impl::impl
object which holds all metric related data for one application.

This patch changes the implementation details such that multiple
metrics::impl::impl objects can exist for any given application.
Said objects are stored into a map on each shard and created
dinamically whenever requested. A metrics::impl::impl is identified
by an integer handle that acts as the key for the storage map.

Implementation note: in order to avoid issues caused by the ordering
of static thread_local objects I had to declare the storage in
reactor.cc.

(cherry picked from commit 585a8af)
This patch extends the metrics internal apis to use a specific
metrics::impl::impl object identified by its integer handle.

(cherry picked from commit 6ee4af7)
Add a public method to metric_groups_impl that exposes the handle
of the internal implementation it is using. This is required in order
for the metric_groups class to be able to reset itself to the configured
implementation handle.
This patch extends the metrics user facing apis to use a specific
metrics::impl::impl object identified by its integer handle.

Note that the constructor of 'metric_groups' is marked explicit
in this patch and updates two call sites where the constructor was used
implicitly.
This patch removes two subsequent calls to `get_local_impl` and reuses
the returned handle in that scope.
This patch extends the user facing prometheus apis allowing the user to
specify the internal metrics implementation to be used through a handle.
Additionally, 'add_prometheus_routes' now takes an argument that
specifies the route on which to advertise the metrics. This enables
different metrics "namespaces" to be served by different endpoints in
isolation.

(cherry picked from commit 6189522)
This patch extends the scollectd apis with the ability to select the
internal metrics implementation to be used by providing a handle.

(cherry picked from commit d4331d1)
This patch adds a 'get_skip_when_empy' getter to the 'registered_metric'
class. It is used by follow-up patches in order to replicate metrics.
This patch adds private methods to the 'metrics::impl' class that deal
with the creation of replicated metrics. They will be used to build the
public api in future commits.
This patch adds private helpers to 'metrics::impl' that deal with the
removal of replicated metric families from their destintation
implementation. These methods will be used in subsequent commits to
manage the lifetime of replicated metrics.
This patch adds a public method to the 'metrics::impl' class:
'set_metric_families_to_replicate'. When this method is called
the families that match any of the specifications will be replicated
on the specified destinations.
This patch extends the metric registration and unregistration processes
to make them aware of metric replication.

In the case of metric registration, if the new metric belongs to a
family that matches one of the replication specs, then a replicated
metric is created accordingly.

For unregistration of a metric, the replicated metric is unregistered
too if one exists.
This patch exposes a method in the public interface of the metrics
module ('replicate_metric_families'), which enables metric replication
internally for the requested metric families.
Extends the metrics api to allow changing the aggregation labels of a
metrics family.

Otherwise one had to un-register every single metric instance in a
metric family and then re-register with the changed aggregation labels.

For metric families with thousands of instances (e.g.: histograms with
lots of different labels) this is quite expensive.

With this change we avoid the full reconstruction of the metrics family
and all its metrics. Only the work associated with marking the metrics
`dirty()` is needed then.
@BenPope BenPope requested a review from Copilot August 6, 2025 13:05

This comment was marked as outdated.

@BenPope BenPope requested review from a team, Copilot, michael-redpanda and pgellert and removed request for a team August 6, 2025 13:07
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Updates the Seastar library to v25.3.x with significant cryptographic infrastructure changes and comprehensive test infrastructure improvements.

Key changes:

  • Replaces OpenSSL implementation with a new TLS provider abstraction supporting both OpenSSL and GnuTLS
  • Adds CPU profiler functionality with comprehensive testing and signal handling
  • Introduces metric family replication capabilities and Prometheus integration testing

Reviewed Changes

Copilot reviewed 71 out of 71 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/unit/tls_test.cc Updates TLS tests to support dual crypto provider architecture with OpenSSL and GnuTLS branches
tests/unit/stall_detector_test_utilities.hh Extracts reusable stall detector testing utilities to shared header
tests/unit/stall_detector_test.cc Refactors to use shared utilities and removes code duplication
tests/unit/prometheus_test.cc Adds comprehensive testing for Prometheus metrics HTTP endpoint
tests/unit/metric_family_replication_test.cc Introduces testing for metrics replication across handles
tests/unit/cpu_profiler_test.cc Comprehensive CPU profiler testing with signal handling and timing validation
tests/unit/cpu_profiler_alloc_test.cc Tests CPU profiler signal handler memory allocation safety
tests/unit/CMakeLists.txt Updates build configuration for new tests and certificate generation
tests/perf/linux_perf_event.cc Removes assertion on perf event read result
src/websocket/common.cc Adds dual crypto provider support for WebSocket SHA1 and base64 operations
src/util/backtrace.cc Adds guarded backtrace function with signal safety
src/seastar.cc Updates includes for new TLS implementation structure
src/rpc/rpc.cc Fixes buffer construction to use explicit constructor
src/net/tls.cc Major refactor splitting TLS implementation with provider abstraction
src/net/tls-impl.hh New header defining TLS implementation interfaces and common structures
src/net/tls-impl.cc Implementation of shared TLS functionality and credential management
src/net/posix-stack.cc Fixes namespace qualification for format function calls
Comments suppressed due to low confidence (1)

tests/unit/stall_detector_test_utilities.hh:77

  • [nitpick] The function name 'spin_user_hires' is ambiguous. Consider renaming to 'spin_user_high_resolution' to clarify that it uses high resolution clock for user-space spinning.
[[maybe_unused]] void spin_user_hires(std::chrono::duration<double> how_much) {

Comment thread tests/unit/cpu_profiler_test.cc
Comment thread tests/unit/cpu_profiler_test.cc
Comment thread tests/unit/tls_test.cc
Comment thread src/net/tls-impl.cc
mmaslankaprv and others added 24 commits August 10, 2025 10:37
We simply ignore premature termination error as it is GnuTLS specific.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
sigev_notify_thread_id is defined as a macro by musl libc, FreeBSD, and the Linux kernel.
The macro is for accessing the corresponding field in sigevent.

see https://sourceware.org/bugzilla/show_bug.cgi?id=27417

(cherry picked from commit bbe1af3)

tdowns note: this is a re-pick of the above as the function moved so the original was not applied
Created tls-impl.cc and tls-impl.h which contains common structures and
definitions that are not dependent on the underlying TLS mechanism.

These changes set the stage for implementing other TLS providers.

Signed-off-by: Michael Boquard <michael@redpanda.com>
This commit adds support for using OpenSSL, instead of GnuTLS, as the
TLS provider within Seastar.  To support this change, the configure
script has been updated to allow users to select which cryptographic
provider should be used by supply `--crypto-provider` and specificying
either `OpenSSL` or `GnuTLS`.

The OpenSSL implementation mirrors the GnuTLS implementation.  Instead
of using callbacks, a custom BIO was created to handle moving data
on/off of the OpenSSL SSL session into the Seastar TLS session data
sinks.

When compiled for OpenSSL, the
`certificate_credentials::set_priority_string` method is compiled out and
replaced with the following:

* `set_cipher_string`
* `set_ciphersuites`
* `enable_server_precedence`
* `set_minimum_tls_version`
* `set_maximum_tls_version`

These methods are specific to OpenSSL.

The github actions have been updated to run the full suite of tests
against both cryptographic providers.

`src/net/tcp.hh` and `src/websocket/server.cc` have been updated to use
OpenSSL instead of GnuTLS, depending upon the build configuration.

Signed-off-by: Michael Boquard <michael@redpanda.com>
Added pretty-print capabilities to seastar::tls::session for OpenSSL and
added a number of log statements that may be helpful if debugging the
implementation.

Signed-off-by: Michael Boquard <michael@redpanda.com>
Prior to this commit, in certain situations, the put() and get() methods
would 'drop into' handshake() if they determined that a re-negotiation
was occurring.  This was unecessary as OpenSSL provides helpful error
codes like SSL_WANT_READ and SSL_WANT_WRITE which indicates if OpenSSL
needed to send/receive non-user data to handle re-negotiations.

This may have also led to a rare situation where the put() method would
assert because the _output_pending future had not resolved.  This change
properly handles the SSL_WANT_READ and SSL_WANT_WRITE conditions and
ensures that the _output_pending future resolves.

Signed-off-by: Michael Boquard <michael@redpanda.com>
More recent versions of OpenSSL requrire CA certificates to have CA:true

Signed-off-by: Michael Boquard <michael@redpanda.com>
Now handling situations where the get() call doesn't throw but does
return an empty buffer indicating EOF.

Signed-off-by: Michael Boquard <michael@redpanda.com>
TLS renegotiation was removed in TLSv1.3 due to issues around security.
This change makes that the default behavior, however a method is exposed
to re-enable it if desired.  This was only added to OpenSSL as GnuTLS
does not provide a means to fully disable renegotiation.

Signed-off-by: Michael Boquard <michael@redpanda.com>
Added a dn_format flag to switch between legacy and RFC2253 format.
Legacy format matches that originally returned by the GnuTLS api
`gnutls_x509_crt_get_dn` which returns the DN in the format of
`C=US,ST=London,L=London,O=Redpanda Data,OU=Core,CN=id`.  RFC2253
format, effectively reverses the order: `CN=id,OU=Core,O=Redpanda
Data,L=London,ST=London,C=US`.

Signed-off-by: Michael Boquard <michael@redpanda.com>
We need to build with OpenSSL for the TLS tests to work on our fork.
Seastar has a C++20 modules variation of the build in GHA, but our
changes have left it quite broken as we don't use modules.

Disable this build for now so we can get the build green and protect
the seastar branch.
This commit introduces an alternative reload callback prototype which
includes the credentials that were reloaded.

In service of that, also adds `reloadable_credentials_base::as_certificate_credentials`
to produce a const-ref to `certificate_credentials`, suitable for exposure
through the new callback.

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
This commit introduces `struct cert_info` which encapsulates useful
information about TLS certificates, currently:

- The certificate's serial number (as assigned by the signing CA)
- The certificate's expiration time

Additionally, this commit adds public API on `tls::certificate_credentials`
to access this information for loaded certificates AND the loaded trust
list.

The motivation for this change is to give applications the ability to
monitor and report on certificate and CA lifetimes on demand, the idea
being that an application may cache serial numbers and expiries when
credentials are reloaded, serving up a summary of those details until
the next reload time.

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
Signed-off-by: Ben Pope <ben@redpanda.com>
Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
GnuTLS provides serial numbers not as a legible string of
hex digits but as a byte array corresponding to an integer
value up to 20 octets wide. The cert_info interface should
reflect this.

To that end, this commit updates the type of `cert_info::serial`
to an non-null-terminated basic_sstring of uint8_t.

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
Passed up to application code as a tls::blob (basic_string_view<char>).

Also provide the ability to fetch this blob directly via
credentials_builder::get_trust_file_blob()

Updates unit tests to accommodate the new reload callback signature.

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
char_traits<unsigned> is deprecated in LLVM 18, slated for
removal in LLVM 19. Therefore remove this non-standard usage
from cert_info.

Extracting the x509 serial number is a relatively infrequent
operation and the result is always <= 20B, so the cost of using
a vector here should be marginal.

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
We disable interception of _Unwind_RaiseException when ASAN is enabled
since it can produce a large number of stack-related false positives
when it is enabled and exceptions are thrown.

Fixes scylladb#2892.
Redpanda builds seastar with depcrecations as errors. However, some of
the newly deprecated functions in Seastar are used heavily in RP. Hence
this commit temporarily comments those annotations out until their usage
is removed in RP.
Pure non-functional refactor to split the long line into three.

(cherry picked from commit a0155ba)
Switch the io-queue cost function over to use max(IOPS cost, TP cost)
away from sum(IOPS cost, TP cost).

In its form today the io-queue fails to reach the advertised
throughput/IOPS numbers from io-properties (see
[seastar#2771](scylladb#2771) for
details).

The reason for this is easy to understand when looking at a disk like the
ones from the m7gd AWS series. They (approximately) achieve full throughput as
well as full IOPS at 4k IOP size.

This means that for cost modelling for a single IO operation it would be
enough to just look at one dimension, either throughput or IOPS but not
both. The current cost function which is always additive in IOP and throughput
cost however costs operation way too high. This leads to the aforementioned
problem where the io-queue underperforms. Taking 4k IOPS on
m7gd as an example we only get around 60% of the disk performance.

From various testing the max cost function seems to model disks better
than the sum cost function and avoids the problem mentioned above. Note
this doesn't change the "outer" cost model of how the costs of
individual operations are summed together. That also isn't modelled
entirely right for more modern disks but is pessimistically fine and not
the focus of this patch.

The problem can be easily shown using the following io-tester config:

```
- name: highprio
  shards: [0]
  type: randwrite
  shard_info:
    parallelism: 32
    reqsize: 4kB|16kB|128kB
    shares: 1000
    think_time: 0

- name: lowprio
  shards: [0]
  type: randwrite
  shard_info:
    parallelism: 32
    reqsize: 4kB|16kB|128kB
    shares: 100
    think_time: 0
```

On m7gd.4xl (all numbers in MB/s):
 - NOIP: Not using io-properties
 - IP: Using io-properties
 - MCF: Using io-properties with the max cost function

|       | 4k        |           |       | 16k       |           |       | 128k      |           |       |
|       | high prio | low prio  | total | high prio | low prio  | total | high prio | low prio  | total |
| NOIP  | 303       | 287       | 591   | 299       | 292       | 591   | 295       | 295       | 590   |
| IP    | 262       | 28        | 290   | 432       | 44        | 476   | 528       | 53        | 581   |
| MCF   | 476       | 56        | 532   | 543       | 55        | 598   | 544       | 55        | 599   |

We see that at low IOP size using no io-properties we do get the full
throughput but no scheduling (bad), using io-properties we do get scheduling
but low throughput and with the max cost function we get both.

More data and extensive discussion can be found here:

 - scylladb#2801 (comment)
 - scylladb#2840 (comment)
 - https://docs.google.com/spreadsheets/d/1u_9tRxePkq-OYFiba8xz4JY54P2VJ-JQBvhuBv04tmo

These tests also include latency tests in priority scenarios and
different kind of disks like EBS. The linked PR also includes a
discussion about a different approach using a weighted cost function.

Note that in general latency is often a function of throughput as the
io-queue arbitrarily keeping things in the queue induces latency. This
is especially true for real world usecases like RP where write
amplification is generally fairly high but at the same time can be
absorbed at the cost of latency.

(cherry picked from commit 9bf2094)
@BenPope
Copy link
Copy Markdown
Author

BenPope commented Aug 10, 2025

Changes in force-push:

  • Added two most recent commits to v25.3.x-pre
  • Removed 3 commits that were already upstream
  • Deflake cpu_profiler test again (move from 5 allowed dropped_samples to 10)
  • Actually fixup the fixups

@BenPope BenPope requested a review from pgellert August 10, 2025 09:47
@piyushredpanda piyushredpanda merged commit 1fea24e into redpanda-data:v25.3.x Aug 11, 2025
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.