ossl: improve error queue hygiene#267
Conversation
There was a problem hiding this comment.
Pull request overview
Improves OpenSSL per-thread error queue hygiene in Seastar’s TLS/OpenSSL backend to prevent stale errors from leaking across cooperative scheduling points and causing misclassification in SSL_get_error().
Changes:
- Adds
make_unknown_ossl_error()to surface and drain the OpenSSL error queue for unexpected conditions. - Switches BIO allocation failure to throw
make_ossl_error()for richer diagnostics. - Introduces
verify_clean_error_queue()checks before key SSL operations and drains/logs leftover errors afterSSL_CTX_new().
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| tls_log.warn("{} stale error on queue before {}: {}", *this, operation, buf); | ||
| SEASTAR_ASSERT(0 && "stale errors on OpenSSL error queue"); |
There was a problem hiding this comment.
verify_clean_error_queue() calls SEASTAR_ASSERT unconditionally when the thread error queue is non-empty. SEASTAR_ASSERT is active in all build modes (include/seastar/util/assert.hh:29), so any stale OpenSSL error (including ones OpenSSL may legitimately leave behind on success) will abort the process. Consider draining/clearing the error queue here to preserve the intended hygiene in production, and gate the hard assert behind a debug-only macro if you still want a fail-fast diagnostic.
| tls_log.warn("{} stale error on queue before {}: {}", *this, operation, buf); | |
| SEASTAR_ASSERT(0 && "stale errors on OpenSSL error queue"); | |
| size_t cleared = 0; | |
| while (ERR_get_error() != 0) { | |
| ++cleared; | |
| } | |
| tls_log.warn("{} cleared {} stale error(s) on queue before {}: {}", *this, cleared, operation, buf); | |
| assert(false && "stale errors on OpenSSL error queue"); |
| // SSL_CTX_new can return a valid context while leaving errors on the | ||
| // error queue from partially-failed system config parsing (e.g. an | ||
| // invalid Ciphersuites value in the system openssl.cnf). | ||
| // See https://github.com/openssl/openssl/issues/30760 | ||
| if (auto errors = get_all_ossl_errors(); !errors.empty()) { |
There was a problem hiding this comment.
This warning attributes any queued errors after SSL_CTX_new() to that call, but the code doesn’t clear the per-thread error queue before invoking SSL_CTX_new(). If the queue was already dirty from earlier work on the same thread, this log will be misleading (and will also silently drain those unrelated errors). Consider clearing/draining the queue immediately before SSL_CTX_new() (and optionally logging if it wasn’t empty) so this warning only reflects new errors generated by SSL_CTX_new().
| // This do_until runs until either a renegotiation occurs or the packet is empty | ||
| while (!eof() && size > 0) { | ||
| size_t bytes_written = 0; | ||
| verify_clean_error_queue("SSL_write_ex"); | ||
| auto write_rc = SSL_write_ex(_ssl.get(), ptr, size, &bytes_written); |
There was a problem hiding this comment.
The new pre-call error-queue invariant (verify_clean_error_queue before SSL_write_ex/SSL_read_ex/SSL_do_handshake/SSL_shutdown) is central to the fix, but there’s no regression test exercising the stale-error-queue scenario described in the PR. Since TLS already has unit coverage (tests/unit/tls_test.cc), consider adding a test that deliberately leaves an error on the OpenSSL per-thread queue and verifies a subsequent, unrelated TLS operation does not misclassify via SSL_get_error (and/or that the queue is drained/handled as intended).
|
The new verification logic is highlighting some possible error leaks (unit test failures here and redpanda ducktape test failures), so I am putting this PR into draft mode until I fix those. |
Every other OpenSSL failure path in the session constructor uses make_ossl_error, which drains the error queue and includes the OpenSSL error details in the exception. This one used a plain std::runtime_error. Align it for consistency.
Drain the error queue and include the OpenSSL error details in the exception for the default switch cases in handle_do_put_ssl_err, do_handshake, do_get and do_shutdown. These default cases are not expected to be reachable — the SSL_get_error codes they cover require callbacks, modes, or BIO configurations that seastar does not use. This change is defensive: if they are ever reached, the error queue is now drained and the OpenSSL error details are included in the exception message rather than being silently discarded.
SSL_CTX_new can return a valid context while leaving errors on the per-thread error queue. This happens when OpenSSL's system config parsing partially fails but the failure is masked — for example, ssl_do_config may call SSL_CONF_cmd which pushes errors (e.g. SSL_R_NO_CIPHER_MATCH from an invalid Ciphersuites value in the system openssl.cnf), but ssl_do_config itself returns success when the system flag is set and conf_diagnostics is disabled. Introduce a clear_stale_ssl_errors helper that drains and logs any stale errors at debug level, and use it after SSL_CTX_new succeeds.
Several SSL operations can return success while leaving errors on the per-thread error queue. This happens when OpenSSL internally writes through our custom BIO (e.g. to send a close_notify alert, a TLS 1.3 NewSessionTicket, or an application data record), our bwrite callback fails with EPIPE and returns 0, but OpenSSL's record layer misclassifies the failure as success. Root cause: bio_write_intern (crypto/bio/bio_lib.c) passes our bwrite return of 0 through unchanged. BIO_write returns 0. Then tls_retry_write_records (ssl/record/methods/tls_common.c) checks 'if (i >= 0)' and classifies BIO_write returning 0 as success. Our bwrite returns 0 following the documented BIO_write_ex contract (1 success, 0 failure), but BIO_write's own contract says 0 means "BIO is NULL or dlen <= 0" — not an error. The read side had this same class of bug, fixed on master in OpenSSL commit be42447469. The affected operations and how they reach our bwrite callback: - SSL_shutdown: ssl3_shutdown -> ssl3_send_alert -> ssl3_dispatch_alert -> write_records -> tls_retry_write_records -> BIO_write -> bio_write_intern -> our bwrite callback (sending close_notify alert) - SSL_do_handshake: state_machine -> TLS_ST_SW_SESSION_TICKET -> write_records -> tls_retry_write_records -> BIO_write -> bio_write_intern -> our bwrite callback (sending NewSessionTicket). Additionally, statem_srvr.c deliberately ignores flush failures via conn_is_closed() for this case. - SSL_write_ex: ssl3_write_bytes -> tls_write_records -> tls_retry_write_records -> BIO_write -> bio_write_intern -> our bwrite callback (sending application data record) - SSL_read_ex: can internally trigger writes (e.g. TLS 1.3 key update responses) through the same record layer write path. Not yet observed in test logs but covered defensively. This is a workaround for the missing return-value translation in bio_write_intern and can be removed once the upstream fix is available.
OpenSSL's SSL_get_error relies on the per-thread error queue to classify failures. In seastar's cooperative scheduling model, multiple TLS sessions share the same thread. If one session leaves stale errors on the queue and then yields, another session's SSL_get_error call may misclassify the error — e.g. reporting SSL_ERROR_SYSCALL instead of SSL_ERROR_SSL or vice versa — because SSL_get_error peeks at the oldest error on the queue to decide the classification (ssl_lib.c, ossl_ssl_get_error). We have seen error queue contamination cause issues in practice (e1625c8 "net: avoid propagating system errors to errno", cd02ecc "ossl: Added ERR_clear_error if disconnected post write"). The current code is believed to be correct — all error paths drain the queue before scheduling points — but we add these checks to aid debugging if a regression is introduced in the future. Add verify_clean_error_queue() checks before every SSL_do_handshake, SSL_write_ex, SSL_read_ex and SSL_shutdown call. If stale errors are found, the first entry is logged at warn level and an assertion fires to surface the problem in tests.
d0e2824 to
e5f91e1
Compare
|
force-push: I've now investigated all the failures highlighted by the new validation logic and pushed further bug fixes to clear stale errors on the thread-local error queue in various cases (see the commit messages for specifics). I'm testing these changes through redpanda CI on redpanda-data/redpanda#30125. |
OpenSSL uses a per-thread error queue to communicate failure details. In seastar's cooperative scheduling model, multiple TLS sessions share the same thread. If stale errors are left on the queue across a scheduling point, SSL_get_error on an
unrelated session can misclassify the result — e.g. reporting SSL_ERROR_SYSCALL instead of SSL_ERROR_SSL — because SSL_get_error peeks at the oldest error on the queue to decide the classification.
We have hit this in practice (e1625c8, cd02ecc) and it resulted in hard-to-debug customer issues. This PR makes the following changes to harden the error queue handling. See the commit messages for further details.
Drain stale errors on error paths:
Drain stale errors after successful operations:
translation gap in bio_write_intern where our bwrite callback's 0-means-failure is misclassified as success by tls_retry_write_records. See commit message for detailed call chains.
Assert queue is clean:
Fixes https://redpandadata.atlassian.net/browse/CORE-15655