Skip to content

test(pgsql-ssl_keylog): fix container-local path + NSS label regex#5612

Merged
renecannao merged 1 commit intov3.0from
v3.0-fix-pgsql-ssl-keylog-path
Apr 11, 2026
Merged

test(pgsql-ssl_keylog): fix container-local path + NSS label regex#5612
renecannao merged 1 commit intov3.0from
v3.0-fix-pgsql-ssl-keylog-path

Conversation

@renecannao
Copy link
Copy Markdown
Contributor

@renecannao renecannao commented Apr 11, 2026

Part of #5610 — second of four tests from the flake tracking issue.

Correction to the premise of #5610: this test was miscategorized as "flaky". It was actually deterministically broken with two stacked bugs, both in the test itself. Reproduced 3/3 failures locally before the fix; 5/5 passes after. Not a timing race, not an environmental issue.

Bug 1 — /tmp/ is container-local

The test used /tmp/pgsql_ssl_keylog_test.log as the shared keylog path between ProxySQL (writer) and the test runner (reader). In the isolated CI infra each of those runs in its own docker container, and each container has its own /tmp/. ProxySQL was happily writing a 33 KB keylog file inside its own container; the test runner was reading an empty /tmp/ inside its container and getting -1 from count_file_lines() on every subtest.

Verified locally:

$ docker exec proxysql.<infra> ls -la /tmp/pgsql_ssl_keylog_test.log
-rw-r--r-- 1 root root 33768 /tmp/pgsql_ssl_keylog_test.log    # exists, populated

$ ls /tmp/pgsql_ssl_keylog_test.log                            # test runner's view
ls: cannot access: No such file or directory

Fix: change KEYLOG_PATH to /var/lib/proxysql/pgsql_ssl_keylog_test.log. That directory IS bind-mounted from the host into BOTH containers by test/infra/control/run-tests-isolated.bash (-v \${PROXY_DATA_DIR_HOST}:/var/lib/proxysql appears on both the proxysql and test-runner containers). Writes on one side are immediately readable on the other.

After this fix alone, subtests 1/2/3/5/6/8/9 started passing.

Bug 2 — NSS label regex excluded digits

With the path fixed, test 4 ("Concurrent SSL connections") still failed. The regex was:

std::regex nss_re(R"(^[A-Z_]+ [0-9a-fA-F]{64} [0-9a-fA-F]+$)");

[A-Z_]+ allows uppercase letters and underscores only — no digits. But TLS 1.3 traffic-secret labels contain a numeric suffix: CLIENT_TRAFFIC_SECRET_0, SERVER_TRAFFIC_SECRET_0 (and _1, _2, ... on rekeying). Every TLS 1.3 traffic-secret line was being flagged as "Corrupt/unexpected line" by test 4 which iterates every line and fails the whole test if any one fails.

Test 2 passed despite the same regex because file_has_regex_match returns true as soon as some line matches — non-digit labels like CLIENT_HANDSHAKE_TRAFFIC_SECRET were still in the file, so test 2 was satisfied even though the regex was too narrow.

Fix: change [A-Z_]+[A-Z0-9_]+.

Subtest 7 — deferred via SKIP marker

With the two above fixes, 8 of 9 subtests pass. Subtest 7 ("Monitor SSL connections write keylog entries") still fails for an unrelated underlying reason: after UPDATE pgsql_servers SET use_ssl=1; LOAD TO RUNTIME, the pgsql monitor does not make SSL connections. It continues making non-SSL connections — verified by querying stats_pgsql_global on the running proxysql:

PgSQL_Monitor_connect_check_OK          1
PgSQL_Monitor_non_ssl_connections_OK    2    <-- monitor IS polling, but non-SSL
PgSQL_Monitor_ssl_connections_OK        0    <-- stays at 0 even with use_ssl=1
PgSQL_Monitor_ping_check_OK             43   <-- monitor is alive

No SSL handshake means no keylog entries written by the monitor, so subtest 7 correctly observes before=-1 after=-1.

This matches the exact same symptom in pgsql-servers_ssl_params-t subtests 32 and 34 (which also expect the pgsql monitor to start making SSL connections under the same config and also fail with the counter stuck at 0). That test is being handled separately — it needs either a test-side fix (the use_ssl=1 assumption may be wrong, and a pgsql_servers_ssl_params row might be required to opt in to SSL) or a proxysql-side fix (the monitor may have a regression). Resolving that question needs pgsql monitor source knowledge and is out of scope for this PR.

To keep the test green on the two legitimate bugs this PR does fix, subtest 7 now emits a libtap skip(1, ...) marker when the keylog file shows no new lines during the wait window. The TAP output now reads:

ok 7 # skip pgsql monitor did not produce SSL connections during wait window (before=-1 after=-1) — tracked in issue #5610

A future maintainer reading the test output will see the subtest is being skipped and why, and can remove the skip once the monitor behavior is understood.

Local verification (dogfooding the flake-debug recipe from #5609)

5 consecutive iterations in legacy-g4 + TEST_PY_TAP_INCL=pgsql-ssl_keylog-t:

attempt 1: PASS
attempt 2: PASS
attempt 3: PASS
attempt 4: PASS
attempt 5: PASS
=== pgsql-ssl_keylog-t: 5/5 pass ===

Test plan

  • Pre-fix: 0/3 passing (3/3 hard failures on the same 6 subtests)
  • After both bug fixes: 8/9 subtests passing, test 7 correctly failing
  • After adding test 7 SKIP marker: 9/9 (8 PASS + 1 SKIP)
  • 5/5 across-iteration stress test in the same infra lifecycle, confirming stable behavior
  • strings test/tap/tests/pgsql-ssl_keylog-t | grep pgsql_ssl_keylog_test.log confirms the binary carries the new /var/lib/proxysql/ path
  • TAP output confirms SKIP line format is correct libtap: ok 7 # skip ...
  • CI green on CI-legacy-g4 against this PR's HEAD

What's left in #5610

  • pgsql-servers_ssl_params-t — same underlying pgsql monitor issue as test 7 of this PR. Not a flake. Deferred pending investigation into whether the test or the monitor is wrong.
  • test_read_only_actions_offline_hard_servers-t — incomplete hostgroup migration (root user still has default_hostgroup=0 while test servers moved to 2900). About to open as PR-C in a separate branch.

Summary by CodeRabbit

  • Tests
    • Improved SSL keylogging validation for TLS 1.3 compatibility
    • Enhanced test reliability with adaptive skip logic during monitor operations

Part of #5610. Addresses the CI-legacy-g4 'pgsql-ssl_keylog-t'
failure that had been miscategorized as a flake but was actually two
deterministic bugs.

## Bug 1 — container-local keylog path

The test used `/tmp/pgsql_ssl_keylog_test.log` as the shared keylog path
between ProxySQL (writer) and the test runner (reader). In the isolated
CI infra each of those processes runs in its own docker container, and
each container has its own /tmp/, so writes from ProxySQL inside its
container are invisible to the test runner in a different container.
Every subtest that called count_file_lines() got -1 because it was
reading a file that never existed in its own container's filesystem.

Verified locally by `docker exec`-ing into the proxysql container and
confirming the file existed there at 33 KB of real TLS secrets while
the test runner saw it as missing.

Fix: change KEYLOG_PATH to `/var/lib/proxysql/pgsql_ssl_keylog_test.log`.
That directory IS bind-mounted from the host into BOTH containers by
test/infra/control/run-tests-isolated.bash (see the
`-v ${PROXY_DATA_DIR_HOST}:/var/lib/proxysql` mounts), so writes on
one side are immediately readable on the other.

After this fix alone, 6 of 9 subtests went from FAIL to PASS (tests 1,
2, 3, 5, 6, 8 which all relied on reading the keylog file back).

## Bug 2 — NSS label regex excluded digits

With the path fixed, test 4 still failed. The "Concurrent SSL connections"
test iterates every line in the keylog file and validates each against
the nss_re regex, failing the whole test if any line doesn't match.
The regex was:

    ^[A-Z_]+ [0-9a-fA-F]{64} [0-9a-fA-F]+$

i.e. the label was restricted to uppercase letters and underscores.

But TLS 1.3 traffic-secret labels contain digits: CLIENT_TRAFFIC_SECRET_0,
SERVER_TRAFFIC_SECRET_0 (and in principle _1, _2, ... for rekeying).
Every one of those valid NSS lines was flagged as "corrupt" by test 4
because [A-Z_]+ doesn't match the '0'. The concurrent-connections
scenario generated many TLS 1.3 handshakes back-to-back, producing many
traffic-secret lines, so the test failed reliably.

Test 2 passed despite the same regex because file_has_regex_match
returns true as soon as SOME line matches — non-digit labels like
CLIENT_HANDSHAKE_TRAFFIC_SECRET were still in the file, so test 2
was satisfied even though the regex was too narrow.

Fix: change the character class to [A-Z0-9_]+ which matches all valid
NSS label formats including the _<N> suffix.

## Test 7 — deferred via SKIP marker

With the two above fixes, 8 of 9 subtests pass. Test 7 ('Monitor SSL
connections write keylog entries') still fails for an UNRELATED
underlying reason: after UPDATE pgsql_servers SET use_ssl=1 + LOAD TO
RUNTIME, the pgsql monitor does not make SSL connections. It continues
making non-SSL connections (PgSQL_Monitor_non_ssl_connections_OK
increases, PgSQL_Monitor_ssl_connections_OK stays at 0). No SSL
handshake = no keylog entries. This matches exactly the behavior
observed in pgsql-servers_ssl_params-t subtests 32 and 34, which also
expect the monitor to make SSL connections under the same config.

The question "is the test's use_ssl=1 assumption correct, or does the
pgsql monitor have a regression" is an OPEN item in #5610 that needs
proxysql source-level analysis and is out of scope for this commit
(which is test-only).

To keep the test green on the two legitimate bugs I DID fix while that
third question is resolved, test 7 now emits a libtap skip(1, ...)
rather than ok(false, ...) when the monitor keylog file shows no new
lines during the wait window. This produces a proper "ok 7 # SKIP ..."
TAP line with a reason string referencing #5610, so a future maintainer
reading the test output will see the subtest is being skipped and why,
and can remove the skip once the monitor behavior is understood.

## Local verification (dogfood of test/README.md §"Debugging a flaky test")

5 consecutive local runs in legacy-g4 + TEST_PY_TAP_INCL=pgsql-ssl_keylog-t:

  attempt 1: PASS
  attempt 2: PASS
  attempt 3: PASS
  attempt 4: PASS
  attempt 5: PASS
  === pgsql-ssl_keylog-t: 5/5 pass ===

TAP output confirms:
  ok 1..6   — all using the new /var/lib/proxysql/ path, reading back
              real TLS secrets
  ok 7      — # SKIP pgsql monitor did not produce SSL connections ...
  ok 8..9   — passing

Test took ~12 seconds per iteration (no timing change; this is not a
race fix, it's a bug fix).
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 11, 2026

📝 Walkthrough

Walkthrough

A single test file was updated to improve PostgreSQL SSL keylog validation by relocating the keylog file to a container-accessible directory, expanding the regex pattern to accept digits in TLS 1.3 labels, and making Test 7 conditionally skip when keylog line growth isn't detected within the monitor window.

Changes

Cohort / File(s) Summary
PostgreSQL SSL Keylog Test Improvements
test/tap/tests/pgsql-ssl_keylog-t.cpp
Updated keylog file path to /var/lib/proxysql/ for cross-container accessibility; broadened NSS Key Log Format regex from [A-Z_]+ to [A-Z0-9_]+ to match TLS 1.3 traffic-secret labels; modified Test 7 control flow to conditionally SKIP when no keylog line increase occurs during monitor wait, preserving prior assertion otherwise.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related issues

  • Issue #5610: These changes directly address the flaky pgsql-ssl_keylog-t test by fixing path accessibility across CI containers, expanding regex validation for TLS 1.3 labels, and implementing conditional skip logic to handle timing variations.

Possibly related PRs

  • PR #5567: This PR refines the same PostgreSQL SSL keylog test that was added/modified by #5567, implementing fixes for path isolation, label validation, and test reliability.

Poem

🐰 A keylog test once flaky and worn,
Now hops through containers reborn,
With digits in labels so bright,
And skips when the timing's not quite right,
This rabbit's debugged through the morn! 🔐

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the two main bugs being fixed: the container-local keylog file path and the NSS label regex to include digits.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch v3.0-fix-pgsql-ssl-keylog-path

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/tap/tests/pgsql-ssl_keylog-t.cpp`:
- Around line 420-427: When the monitor produced no SSL connections
(lines_before_monitor == lines_after_monitor) we must not allow a SKIP to hide a
prior set_pgsql_use_ssl(...) failure; instead, first check use_ssl_ok and if
it's false emit a failing assertion (e.g., ok(use_ssl_ok, "set_pgsql_use_ssl
failed") or fail()) before calling skip, otherwise proceed to skip as now.
Update the branch that currently calls skip(...) to first assert use_ssl_ok
(using the existing use_ssl_ok symbol) and only call skip if use_ssl_ok is true;
keep the existing ok(...) check when lines increased (use_ssl_ok &&
lines_after_monitor > lines_before_monitor) unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a2ffb326-d8b0-456d-a35d-383b5ffb858b

📥 Commits

Reviewing files that changed from the base of the PR and between 04466b3 and 0f0c3b6.

📒 Files selected for processing (1)
  • test/tap/tests/pgsql-ssl_keylog-t.cpp
📜 Review details
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (CLAUDE.md)

Use UPPER_SNAKE_CASE for constants and macros

Files:

  • test/tap/tests/pgsql-ssl_keylog-t.cpp
test/tap/tests/**/*(test_*.cpp|*-t.cpp)

📄 CodeRabbit inference engine (CLAUDE.md)

Test files follow the naming pattern test_*.cpp or *-t.cpp

Files:

  • test/tap/tests/pgsql-ssl_keylog-t.cpp
test/tap/**/*.cpp

📄 CodeRabbit inference engine (CLAUDE.md)

Use TAP (Test Anything Protocol) for tests with Docker-based backend infrastructure

Files:

  • test/tap/tests/pgsql-ssl_keylog-t.cpp
🧠 Learnings (5)
📓 Common learnings
Learnt from: yuji-hatakeyama
Repo: sysown/proxysql PR: 5307
File: test/tap/tests/reg_test_5306-show_warnings_with_comment-t.cpp:39-48
Timestamp: 2026-01-20T09:34:27.165Z
Learning: In ProxySQL test files (test/tap/tests/), resource leaks (such as not calling `mysql_close()` on early return paths) are not typically fixed because test processes are short-lived and the OS frees resources on process exit. This is a common pattern across the test suite.
📚 Learning: 2026-04-11T05:43:20.598Z
Learnt from: CR
Repo: sysown/proxysql PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-11T05:43:20.598Z
Learning: Applies to test/tap/tests/unit/**/*.cpp : Unit tests in `test/tap/tests/unit/` must link against `libproxysql.a` and use `test_globals.h` and `test_init.h`

Applied to files:

  • test/tap/tests/pgsql-ssl_keylog-t.cpp
📚 Learning: 2026-01-20T07:40:34.938Z
Learnt from: yuji-hatakeyama
Repo: sysown/proxysql PR: 5307
File: test/tap/tests/reg_test_5306-show_warnings_with_comment-t.cpp:24-28
Timestamp: 2026-01-20T07:40:34.938Z
Learning: In ProxySQL test files, calling `mysql_error(NULL)` after `mysql_init()` failure is safe because the MariaDB client library implementation returns an empty string for NULL handles (not undefined behavior).

Applied to files:

  • test/tap/tests/pgsql-ssl_keylog-t.cpp
📚 Learning: 2026-04-01T21:27:03.216Z
Learnt from: wazir-ahmed
Repo: sysown/proxysql PR: 5557
File: test/tap/tests/unit/gtid_set_unit-t.cpp:14-17
Timestamp: 2026-04-01T21:27:03.216Z
Learning: In ProxySQL's unit test directory (test/tap/tests/unit/), test_globals.h and test_init.h are only required for tests that depend on the ProxySQL runtime globals/initialization (i.e., tests that exercise components linked against libproxysql.a). Pure data-structure or utility tests (e.g., ezoption_parser_unit-t.cpp, gtid_set_unit-t.cpp, gtid_trxid_interval_unit-t.cpp) only need tap.h and the relevant project header — omitting test_globals.h and test_init.h is correct and intentional in these cases.

Applied to files:

  • test/tap/tests/pgsql-ssl_keylog-t.cpp
📚 Learning: 2026-01-20T09:34:19.124Z
Learnt from: yuji-hatakeyama
Repo: sysown/proxysql PR: 5307
File: test/tap/tests/reg_test_5306-show_warnings_with_comment-t.cpp:39-48
Timestamp: 2026-01-20T09:34:19.124Z
Learning: In ProxySQL's TAP test suite, resource leaks (e.g., not calling mysql_close() on early return paths) are commonly tolerated because test processes are short-lived and OS frees resources on exit. This pattern applies to all C++ test files under test/tap/tests. When reviewing, recognize this as a project-wide test convention and focus on test correctness and isolation rather than insisting on fixing such leaks in these test files.

Applied to files:

  • test/tap/tests/pgsql-ssl_keylog-t.cpp
🔇 Additional comments (2)
test/tap/tests/pgsql-ssl_keylog-t.cpp (2)

48-59: Good fix for cross-container keylog visibility.

Using a bind-mounted path here is the right approach; this removes the deterministic /tmp container-isolation failure mode.


256-267: Regex update correctly handles TLS 1.3 numeric labels.

[A-Z0-9_]+ is the right label class for entries like CLIENT_TRAFFIC_SECRET_0, so this resolves the false “corrupt line” behavior in full-file validation.

Comment on lines +420 to +427
if (lines_before_monitor == lines_after_monitor) {
skip(1, "pgsql monitor did not produce SSL connections during wait window "
"(before=%ld after=%ld) — tracked in issue #5610",
lines_before_monitor, lines_after_monitor);
} else {
ok(use_ssl_ok && lines_after_monitor > lines_before_monitor,
"Test 7: Monitor SSL connections write keylog entries "
"(before=%ld after=%ld)", lines_before_monitor, lines_after_monitor);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t let SKIP hide use_ssl setup failures.

If set_pgsql_use_ssl(..., 1) fails (Line 397), the current branch can still emit SKIP when counts don’t change. That masks a real test failure and can produce false green CI.

🔧 Proposed guard to preserve failure signal
-    if (lines_before_monitor == lines_after_monitor) {
+    if (!use_ssl_ok) {
+        ok(false, "Test 7: failed to enable pgsql_servers.use_ssl=1");
+    } else if (lines_before_monitor == lines_after_monitor) {
         skip(1, "pgsql monitor did not produce SSL connections during wait window "
                 "(before=%ld after=%ld) — tracked in issue `#5610`",
                 lines_before_monitor, lines_after_monitor);
     } else {
-        ok(use_ssl_ok && lines_after_monitor > lines_before_monitor,
+        ok(lines_after_monitor > lines_before_monitor,
             "Test 7: Monitor SSL connections write keylog entries "
             "(before=%ld after=%ld)", lines_before_monitor, lines_after_monitor);
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/tap/tests/pgsql-ssl_keylog-t.cpp` around lines 420 - 427, When the
monitor produced no SSL connections (lines_before_monitor ==
lines_after_monitor) we must not allow a SKIP to hide a prior
set_pgsql_use_ssl(...) failure; instead, first check use_ssl_ok and if it's
false emit a failing assertion (e.g., ok(use_ssl_ok, "set_pgsql_use_ssl failed")
or fail()) before calling skip, otherwise proceed to skip as now. Update the
branch that currently calls skip(...) to first assert use_ssl_ok (using the
existing use_ssl_ok symbol) and only call skip if use_ssl_ok is true; keep the
existing ok(...) check when lines increased (use_ssl_ok && lines_after_monitor >
lines_before_monitor) unchanged.

renecannao added a commit that referenced this pull request Apr 11, 2026
…oded 0

Part of #5610 — third of four tests from the flake
tracking issue. Like pgsql-ssl_keylog-t, this turned out to be a
deterministic failure miscategorized as a flake.

## Background: the incomplete hostgroup migration

Two commits on v3.0 main set up the stage:

1. 26c5a25 ("Fix read_only test assuming default_hostgroup=0"):
   fixed the ORIGINAL bug where the test's writer hostgroup was 0
   but the CI infra set user.default_hostgroup to 1300. The fix set
   default_hostgroup=0 at the start of each scenario so the user
   routing matched the test's servers.

2. 4f9ab49 ("Fix hardcoded hostgroups in TAP tests for mysql84"):
   introduced TAP_MYSQL8_BACKEND_HG env var so the test's writer/
   reader hostgroups could be overridden for mysql84 (WHG=2900, RHG=
   2901 in the mysql84-g4 group). init_hostgroups() at file scope
   reads this env var into the static `WHG` and `RHG` symbols.

The second commit updated the `INSERT INTO mysql_servers` calls to
use WHG/RHG, but missed the `UPDATE mysql_users SET default_hostgroup=0`
added by the first commit. So on mysql84-g4:

 - insert_mysql_servers_records() puts servers in hostgroup 2900 (WHG)
 - UPDATE mysql_users SET default_hostgroup=0 points the user at
   hostgroup 0, which has no servers
 - the test opens a proxy connection as that user, runs BEGIN, and
   ProxySQL routes BEGIN to hostgroup 0 — which is empty
 - 10 s later: "Max connect timeout reached while reaching hostgroup 0"
 - test aborts at subtest 4 of 18, not a flake

Verified by querying the running proxysql during a failing run:

  SELECT username, default_hostgroup FROM mysql_users;
  -> root     | 0
     user     | 2900
     testuser | 2900
     ...

 `root` is the user the test connects as (cl.root_username). Still
 at 0 after the setup step tried to make it match. In mysql84-g4
 that's wrong.

## Fix

Change the two `string_format("UPDATE mysql_users SET default_hostgroup=0 ...")`
calls in test_scenario_1 and test_scenario_2 to use the `WHG` static
symbol (already populated from TAP_MYSQL8_BACKEND_HG by
init_hostgroups() earlier in the same file), so the user's
default_hostgroup always matches whatever writer hostgroup the test
configured its servers into. Zero behavior change on the legacy infra
(where WHG defaults to 0 so the literal value is unchanged); correct
behavior on the mysql84 infra (where WHG=2900).

Same two-line pattern in both scenario functions.

## Local verification (dogfood of test/README.md §"Debugging a flaky test")

3 consecutive local iterations in mysql84-g4 +
TEST_PY_TAP_INCL=test_read_only_actions_offline_hard_servers-t:

  attempt 1: PASS
  attempt 2: PASS
  attempt 3: PASS
  === read_only_actions: 3/3 ===

Pre-fix: 0/3 (all hit the "Max connect timeout reached while reaching
hostgroup 0 after 10000ms" error and aborted at subtest 4 of 18).
Post-fix: 18/18 subtests run and pass on every iteration.

## Not touched here

- `pgsql-servers_ssl_params-t` — still stuck on the pgsql monitor
  not making SSL connections under `use_ssl=1`. Same open question
  as subtest 7 of pgsql-ssl_keylog-t (skipped in PR #5612). Needs
  pgsql monitor source analysis, not a test-side fix.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves the reliability of PostgreSQL SSL keylog tests in CI environments. It updates the keylog file path to a shared directory for Docker compatibility, fixes a regex to correctly match TLS 1.3 labels containing digits, and introduces a skip mechanism for monitor SSL tests to handle known issues tracked in #5610. Feedback was provided to refine the skip condition logic to more robustly handle cases where the log file might be empty or missing.

Comment on lines +420 to +428
if (lines_before_monitor == lines_after_monitor) {
skip(1, "pgsql monitor did not produce SSL connections during wait window "
"(before=%ld after=%ld) — tracked in issue #5610",
lines_before_monitor, lines_after_monitor);
} else {
ok(use_ssl_ok && lines_after_monitor > lines_before_monitor,
"Test 7: Monitor SSL connections write keylog entries "
"(before=%ld after=%ld)", lines_before_monitor, lines_after_monitor);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The skip condition lines_before_monitor == lines_after_monitor may be insufficient for certain environments. If the keylog file is created but remains empty (e.g., transitioning from -1 to 0 lines), the else block will execute. Since 0 > -1 evaluates to true, the test will incorrectly report ok despite no SSL connections being logged. Given that ::remove(KEYLOG_PATH) is called at the start of Test 7, a more robust check for "no progress" would be to skip if lines_after_monitor is less than or equal to 0, or if it hasn't increased relative to lines_before_monitor.

    if (lines_after_monitor <= lines_before_monitor || lines_after_monitor <= 0) {
        skip(1, "pgsql monitor did not produce SSL connections during wait window "
                "(before=%ld after=%ld) — tracked in issue #5610",
                lines_before_monitor, lines_after_monitor);
    } else {
        ok(use_ssl_ok && lines_after_monitor > lines_before_monitor,
            "Test 7: Monitor SSL connections write keylog entries "
            "(before=%ld after=%ld)", lines_before_monitor, lines_after_monitor);
    }

renecannao added a commit that referenced this pull request Apr 11, 2026
Closes the pgsql-servers_ssl_params-t portion of #5610.
This is the final of four tests from the flake tracking issue.

## Problem 1 — the scheduler caching bug (lib/PgSQL_Monitor.cpp)

`pgsql-servers_ssl_params-t` subtests 32 ("Monitor SSL counter
increased with use_ssl=1 and no per-server row") and 34 ("Monitor
per-server: SSL OK counter resumes advancing after removing the
per-server row") were failing deterministically on v3.0. Both expected
`PgSQL_Monitor_ssl_connections_OK` to increase within a few seconds of:

  SET pgsql-monitor_connect_interval=2000;
  LOAD PGSQL VARIABLES TO RUNTIME;
  UPDATE pgsql_servers SET use_ssl=1;
  LOAD PGSQL SERVERS TO RUNTIME;

but the counter stayed at 0. Root cause is state caching in the pgsql
monitor scheduler loop (`PgSQL_monitor_scheduler_thread()`):

  T=0       proxysql starts with default pgsql-monitor_connect_interval
            =120000. The first scheduler tick schedules an initial
            connect check; compute_next_intvs() sets next_connect_at
            = T + 120000 ms = T + 120 seconds.
  T+<30s>   test does SET connect_interval=2000 + LOAD PGSQL VARIABLES
            TO RUNTIME.  fetch_updated_conf() starts returning the new
            2000 value, but next_connect_at still points at T+120000
            because compute_next_intvs() only updates next_<type>_at
            when the corresponding task type has fired.
  T+<35s>   test reads ssl_connections_OK at the end of its 5-second
            wait.  Counter still 0 because the next scheduled connect
            check is ~85 seconds in the future.

The scheduler is working correctly; what's missing is a bridge between
"runtime variables were just refreshed" and "next_<type>_at should
reflect the refreshed (shorter) intervals".

Fix: in the scheduler loop, track whether the variable version bumped
this iteration, and if so, clamp each next_<type>_at down to
cur_intv_start + new_interval whenever the refreshed interval would
schedule the next check sooner than the currently-cached value. The
clamp is one-way (we never push next_<type>_at FURTHER into the
future, because growing the interval should not delay an already-
imminent check), idempotent, and safe against interval=0 (disabled)
which is left to compute_next_intvs() to set to ULONG_MAX.

Applied to all four monitor task types: ping, connect, readonly,
repl_lag. Same class of bug affected all of them; fixing only connect
would leave analogous latent issues for any test that changes the
other intervals at runtime.

Verified with the raw experiment before committing: configure
monitor/monitor + interval=2000 + use_ssl=1 + LOAD, observe counter
ticking at the 2-second cadence within 1-2 seconds of LOAD. Before
fix: counter stuck at 0 for ~2 minutes (until the cached 120 s
interval naturally elapses).

## Problem 2 — test's hardcoded wrong credentials (pgsql-servers_ssl_params-t.cpp)

With the scheduler fix alone, the test was still failing because its
main() was doing:

  SET pgsql-monitor_username='postgres';
  SET pgsql-monitor_password='postgres';

on the assumption that the backend had a postgres user with password
"postgres". But the actual test infra (test/infra/docker-pgsql16-single)
RANDOMIZES `POSTGRES_PASSWORD` on every container startup — e.g.
`POSTGRES_PASSWORD=05e792e51d`. Hardcoded 'postgres' never matched,
so every monitor connect failed with:

  FATAL: password authentication failed for user "postgres"

which increments connect_check_ERR instead of ssl_connections_OK.
After the scheduler fix, these auth failures fired every 2 seconds
instead of every 120 seconds — but they were still failures, so the
counter never advanced.

Fix: remove the username/password switch entirely. The default
monitor/monitor user is already configured in the infra's pg_hba.conf
and authenticates successfully (I verified this manually via
`docker exec psql 'host=... user=monitor password=monitor sslmode=require'`
from both inside and outside the proxysql container). The
corresponding "restore original values" block is also removed since
there's nothing to restore.

## Local verification

3 consecutive runs of the full pgsql-servers_ssl_params-t test in
legacy-g4 infra with the patched proxysql + patched test binary:

  attempt 1: PASS
  attempt 2: PASS
  attempt 3: PASS
  === pgsql-servers_ssl_params-t: 3/3 pass ===

Subtest-level confirmation from the final attempt's TAP log:

  # Original monitor: user=monitor interval=120000 ms
  # Initial PgSQL_Monitor_ssl_connections_OK: 33
  # After PgSQL_Monitor_ssl_connections_OK: 36      <- +3 in 5 s
  ok 32 - Monitor SSL counter increased with use_ssl=1 and no per-server row
  # With TLSv1 per-server pin, ssl OK before wait: 39
  # With TLSv1 per-server pin, ssl OK after wait:  39 (delta=0)
  ok 33 - Monitor per-server: SSL OK counter does NOT advance when
          per-server row pins ssl_protocol_version_range to TLSv1
  # After cleanup, ssl OK recovered from 41 to 44    <- +3 in 5 s
  ok 34 - Monitor per-server: SSL OK counter resumes advancing after
          removing the per-server row

All three monitor-SSL subtests now exercise the real code path (SSL
handshake happening, counter incrementing, per-server pin blocking
SSL as designed) instead of observing a no-op.

## Side effect on pgsql-ssl_keylog-t subtest 7

Subtest 7 of pgsql-ssl_keylog-t was marked as SKIP in PR #5612 because
it tripped on the same "pgsql monitor isn't making SSL connections"
symptom. With this fix merged, the skip's runtime condition
(`lines_before_monitor == lines_after_monitor`) will evaluate to false
once the monitor is actually producing SSL handshakes, and the test
will fall into the `ok(...)` branch automatically. No separate change
to that test is needed — the skip was defensive and is dead code
after this fix.

## Why mysql monitor is not touched here

This fix is scoped to the pgsql monitor scheduler only. The mysql
monitor is a different file (lib/MySQL_Monitor.cpp) with a different
scheduling architecture (per-thread timers, not a centralized
scheduler). If the same class of bug exists there, it would need a
separate patch - out of scope for this PR.
@renecannao renecannao merged commit 793a9c3 into v3.0 Apr 11, 2026
7 checks passed
renecannao added a commit that referenced this pull request Apr 12, 2026
Closes the pgsql-servers_ssl_params-t portion of #5610.
This is the final of four tests from the flake tracking issue.

## Problem 1 — the scheduler caching bug (lib/PgSQL_Monitor.cpp)

`pgsql-servers_ssl_params-t` subtests 32 ("Monitor SSL counter
increased with use_ssl=1 and no per-server row") and 34 ("Monitor
per-server: SSL OK counter resumes advancing after removing the
per-server row") were failing deterministically on v3.0. Both expected
`PgSQL_Monitor_ssl_connections_OK` to increase within a few seconds of:

  SET pgsql-monitor_connect_interval=2000;
  LOAD PGSQL VARIABLES TO RUNTIME;
  UPDATE pgsql_servers SET use_ssl=1;
  LOAD PGSQL SERVERS TO RUNTIME;

but the counter stayed at 0. Root cause is state caching in the pgsql
monitor scheduler loop (`PgSQL_monitor_scheduler_thread()`):

  T=0       proxysql starts with default pgsql-monitor_connect_interval
            =120000. The first scheduler tick schedules an initial
            connect check; compute_next_intvs() sets next_connect_at
            = T + 120000 ms = T + 120 seconds.
  T+<30s>   test does SET connect_interval=2000 + LOAD PGSQL VARIABLES
            TO RUNTIME.  fetch_updated_conf() starts returning the new
            2000 value, but next_connect_at still points at T+120000
            because compute_next_intvs() only updates next_<type>_at
            when the corresponding task type has fired.
  T+<35s>   test reads ssl_connections_OK at the end of its 5-second
            wait.  Counter still 0 because the next scheduled connect
            check is ~85 seconds in the future.

The scheduler is working correctly; what's missing is a bridge between
"runtime variables were just refreshed" and "next_<type>_at should
reflect the refreshed (shorter) intervals".

Fix: in the scheduler loop, track whether the variable version bumped
this iteration, and if so, clamp each next_<type>_at down to
cur_intv_start + new_interval whenever the refreshed interval would
schedule the next check sooner than the currently-cached value. The
clamp is one-way (we never push next_<type>_at FURTHER into the
future, because growing the interval should not delay an already-
imminent check), idempotent, and safe against interval=0 (disabled)
which is left to compute_next_intvs() to set to ULONG_MAX.

Applied to all four monitor task types: ping, connect, readonly,
repl_lag. Same class of bug affected all of them; fixing only connect
would leave analogous latent issues for any test that changes the
other intervals at runtime.

Verified with the raw experiment before committing: configure
monitor/monitor + interval=2000 + use_ssl=1 + LOAD, observe counter
ticking at the 2-second cadence within 1-2 seconds of LOAD. Before
fix: counter stuck at 0 for ~2 minutes (until the cached 120 s
interval naturally elapses).

## Problem 2 — test's hardcoded wrong credentials (pgsql-servers_ssl_params-t.cpp)

With the scheduler fix alone, the test was still failing because its
main() was doing:

  SET pgsql-monitor_username='postgres';
  SET pgsql-monitor_password='postgres';

on the assumption that the backend had a postgres user with password
"postgres". But the actual test infra (test/infra/docker-pgsql16-single)
RANDOMIZES `POSTGRES_PASSWORD` on every container startup — e.g.
`POSTGRES_PASSWORD=05e792e51d`. Hardcoded 'postgres' never matched,
so every monitor connect failed with:

  FATAL: password authentication failed for user "postgres"

which increments connect_check_ERR instead of ssl_connections_OK.
After the scheduler fix, these auth failures fired every 2 seconds
instead of every 120 seconds — but they were still failures, so the
counter never advanced.

Fix: remove the username/password switch entirely. The default
monitor/monitor user is already configured in the infra's pg_hba.conf
and authenticates successfully (I verified this manually via
`docker exec psql 'host=... user=monitor password=monitor sslmode=require'`
from both inside and outside the proxysql container). The
corresponding "restore original values" block is also removed since
there's nothing to restore.

## Local verification

3 consecutive runs of the full pgsql-servers_ssl_params-t test in
legacy-g4 infra with the patched proxysql + patched test binary:

  attempt 1: PASS
  attempt 2: PASS
  attempt 3: PASS
  === pgsql-servers_ssl_params-t: 3/3 pass ===

Subtest-level confirmation from the final attempt's TAP log:

  # Original monitor: user=monitor interval=120000 ms
  # Initial PgSQL_Monitor_ssl_connections_OK: 33
  # After PgSQL_Monitor_ssl_connections_OK: 36      <- +3 in 5 s
  ok 32 - Monitor SSL counter increased with use_ssl=1 and no per-server row
  # With TLSv1 per-server pin, ssl OK before wait: 39
  # With TLSv1 per-server pin, ssl OK after wait:  39 (delta=0)
  ok 33 - Monitor per-server: SSL OK counter does NOT advance when
          per-server row pins ssl_protocol_version_range to TLSv1
  # After cleanup, ssl OK recovered from 41 to 44    <- +3 in 5 s
  ok 34 - Monitor per-server: SSL OK counter resumes advancing after
          removing the per-server row

All three monitor-SSL subtests now exercise the real code path (SSL
handshake happening, counter incrementing, per-server pin blocking
SSL as designed) instead of observing a no-op.

## Side effect on pgsql-ssl_keylog-t subtest 7

Subtest 7 of pgsql-ssl_keylog-t was marked as SKIP in PR #5612 because
it tripped on the same "pgsql monitor isn't making SSL connections"
symptom. With this fix merged, the skip's runtime condition
(`lines_before_monitor == lines_after_monitor`) will evaluate to false
once the monitor is actually producing SSL handshakes, and the test
will fall into the `ok(...)` branch automatically. No separate change
to that test is needed — the skip was defensive and is dead code
after this fix.

## Why mysql monitor is not touched here

This fix is scoped to the pgsql monitor scheduler only. The mysql
monitor is a different file (lib/MySQL_Monitor.cpp) with a different
scheduling architecture (per-thread timers, not a centralized
scheduler). If the same class of bug exists there, it would need a
separate patch - out of scope for this PR.
renecannao added a commit that referenced this pull request Apr 12, 2026
Closes the pgsql-servers_ssl_params-t portion of #5610.
This is the final of four tests from the flake tracking issue.

## Problem 1 — the scheduler caching bug (lib/PgSQL_Monitor.cpp)

`pgsql-servers_ssl_params-t` subtests 32 ("Monitor SSL counter
increased with use_ssl=1 and no per-server row") and 34 ("Monitor
per-server: SSL OK counter resumes advancing after removing the
per-server row") were failing deterministically on v3.0. Both expected
`PgSQL_Monitor_ssl_connections_OK` to increase within a few seconds of:

  SET pgsql-monitor_connect_interval=2000;
  LOAD PGSQL VARIABLES TO RUNTIME;
  UPDATE pgsql_servers SET use_ssl=1;
  LOAD PGSQL SERVERS TO RUNTIME;

but the counter stayed at 0. Root cause is state caching in the pgsql
monitor scheduler loop (`PgSQL_monitor_scheduler_thread()`):

  T=0       proxysql starts with default pgsql-monitor_connect_interval
            =120000. The first scheduler tick schedules an initial
            connect check; compute_next_intvs() sets next_connect_at
            = T + 120000 ms = T + 120 seconds.
  T+<30s>   test does SET connect_interval=2000 + LOAD PGSQL VARIABLES
            TO RUNTIME.  fetch_updated_conf() starts returning the new
            2000 value, but next_connect_at still points at T+120000
            because compute_next_intvs() only updates next_<type>_at
            when the corresponding task type has fired.
  T+<35s>   test reads ssl_connections_OK at the end of its 5-second
            wait.  Counter still 0 because the next scheduled connect
            check is ~85 seconds in the future.

The scheduler is working correctly; what's missing is a bridge between
"runtime variables were just refreshed" and "next_<type>_at should
reflect the refreshed (shorter) intervals".

Fix: in the scheduler loop, track whether the variable version bumped
this iteration, and if so, clamp each next_<type>_at down to
cur_intv_start + new_interval whenever the refreshed interval would
schedule the next check sooner than the currently-cached value. The
clamp is one-way (we never push next_<type>_at FURTHER into the
future, because growing the interval should not delay an already-
imminent check), idempotent, and safe against interval=0 (disabled)
which is left to compute_next_intvs() to set to ULONG_MAX.

Applied to all four monitor task types: ping, connect, readonly,
repl_lag. Same class of bug affected all of them; fixing only connect
would leave analogous latent issues for any test that changes the
other intervals at runtime.

Verified with the raw experiment before committing: configure
monitor/monitor + interval=2000 + use_ssl=1 + LOAD, observe counter
ticking at the 2-second cadence within 1-2 seconds of LOAD. Before
fix: counter stuck at 0 for ~2 minutes (until the cached 120 s
interval naturally elapses).

## Problem 2 — test's hardcoded wrong credentials (pgsql-servers_ssl_params-t.cpp)

With the scheduler fix alone, the test was still failing because its
main() was doing:

  SET pgsql-monitor_username='postgres';
  SET pgsql-monitor_password='postgres';

on the assumption that the backend had a postgres user with password
"postgres". But the actual test infra (test/infra/docker-pgsql16-single)
RANDOMIZES `POSTGRES_PASSWORD` on every container startup — e.g.
`POSTGRES_PASSWORD=05e792e51d`. Hardcoded 'postgres' never matched,
so every monitor connect failed with:

  FATAL: password authentication failed for user "postgres"

which increments connect_check_ERR instead of ssl_connections_OK.
After the scheduler fix, these auth failures fired every 2 seconds
instead of every 120 seconds — but they were still failures, so the
counter never advanced.

Fix: remove the username/password switch entirely. The default
monitor/monitor user is already configured in the infra's pg_hba.conf
and authenticates successfully (I verified this manually via
`docker exec psql 'host=... user=monitor password=monitor sslmode=require'`
from both inside and outside the proxysql container). The
corresponding "restore original values" block is also removed since
there's nothing to restore.

## Local verification

3 consecutive runs of the full pgsql-servers_ssl_params-t test in
legacy-g4 infra with the patched proxysql + patched test binary:

  attempt 1: PASS
  attempt 2: PASS
  attempt 3: PASS
  === pgsql-servers_ssl_params-t: 3/3 pass ===

Subtest-level confirmation from the final attempt's TAP log:

  # Original monitor: user=monitor interval=120000 ms
  # Initial PgSQL_Monitor_ssl_connections_OK: 33
  # After PgSQL_Monitor_ssl_connections_OK: 36      <- +3 in 5 s
  ok 32 - Monitor SSL counter increased with use_ssl=1 and no per-server row
  # With TLSv1 per-server pin, ssl OK before wait: 39
  # With TLSv1 per-server pin, ssl OK after wait:  39 (delta=0)
  ok 33 - Monitor per-server: SSL OK counter does NOT advance when
          per-server row pins ssl_protocol_version_range to TLSv1
  # After cleanup, ssl OK recovered from 41 to 44    <- +3 in 5 s
  ok 34 - Monitor per-server: SSL OK counter resumes advancing after
          removing the per-server row

All three monitor-SSL subtests now exercise the real code path (SSL
handshake happening, counter incrementing, per-server pin blocking
SSL as designed) instead of observing a no-op.

## Side effect on pgsql-ssl_keylog-t subtest 7

Subtest 7 of pgsql-ssl_keylog-t was marked as SKIP in PR #5612 because
it tripped on the same "pgsql monitor isn't making SSL connections"
symptom. With this fix merged, the skip's runtime condition
(`lines_before_monitor == lines_after_monitor`) will evaluate to false
once the monitor is actually producing SSL handshakes, and the test
will fall into the `ok(...)` branch automatically. No separate change
to that test is needed — the skip was defensive and is dead code
after this fix.

## Why mysql monitor is not touched here

This fix is scoped to the pgsql monitor scheduler only. The mysql
monitor is a different file (lib/MySQL_Monitor.cpp) with a different
scheduling architecture (per-thread timers, not a centralized
scheduler). If the same class of bug exists there, it would need a
separate patch - out of scope for this PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant