fix(pgsql-monitor): clamp next_*_at on LOAD PGSQL VARIABLES TO RUNTIME#5614
Conversation
Closes the pgsql-servers_ssl_params-t portion of #5610. This is the final of four tests from the flake tracking issue. ## Problem 1 — the scheduler caching bug (lib/PgSQL_Monitor.cpp) `pgsql-servers_ssl_params-t` subtests 32 ("Monitor SSL counter increased with use_ssl=1 and no per-server row") and 34 ("Monitor per-server: SSL OK counter resumes advancing after removing the per-server row") were failing deterministically on v3.0. Both expected `PgSQL_Monitor_ssl_connections_OK` to increase within a few seconds of: SET pgsql-monitor_connect_interval=2000; LOAD PGSQL VARIABLES TO RUNTIME; UPDATE pgsql_servers SET use_ssl=1; LOAD PGSQL SERVERS TO RUNTIME; but the counter stayed at 0. Root cause is state caching in the pgsql monitor scheduler loop (`PgSQL_monitor_scheduler_thread()`): T=0 proxysql starts with default pgsql-monitor_connect_interval =120000. The first scheduler tick schedules an initial connect check; compute_next_intvs() sets next_connect_at = T + 120000 ms = T + 120 seconds. T+<30s> test does SET connect_interval=2000 + LOAD PGSQL VARIABLES TO RUNTIME. fetch_updated_conf() starts returning the new 2000 value, but next_connect_at still points at T+120000 because compute_next_intvs() only updates next_<type>_at when the corresponding task type has fired. T+<35s> test reads ssl_connections_OK at the end of its 5-second wait. Counter still 0 because the next scheduled connect check is ~85 seconds in the future. The scheduler is working correctly; what's missing is a bridge between "runtime variables were just refreshed" and "next_<type>_at should reflect the refreshed (shorter) intervals". Fix: in the scheduler loop, track whether the variable version bumped this iteration, and if so, clamp each next_<type>_at down to cur_intv_start + new_interval whenever the refreshed interval would schedule the next check sooner than the currently-cached value. The clamp is one-way (we never push next_<type>_at FURTHER into the future, because growing the interval should not delay an already- imminent check), idempotent, and safe against interval=0 (disabled) which is left to compute_next_intvs() to set to ULONG_MAX. Applied to all four monitor task types: ping, connect, readonly, repl_lag. Same class of bug affected all of them; fixing only connect would leave analogous latent issues for any test that changes the other intervals at runtime. Verified with the raw experiment before committing: configure monitor/monitor + interval=2000 + use_ssl=1 + LOAD, observe counter ticking at the 2-second cadence within 1-2 seconds of LOAD. Before fix: counter stuck at 0 for ~2 minutes (until the cached 120 s interval naturally elapses). ## Problem 2 — test's hardcoded wrong credentials (pgsql-servers_ssl_params-t.cpp) With the scheduler fix alone, the test was still failing because its main() was doing: SET pgsql-monitor_username='postgres'; SET pgsql-monitor_password='postgres'; on the assumption that the backend had a postgres user with password "postgres". But the actual test infra (test/infra/docker-pgsql16-single) RANDOMIZES `POSTGRES_PASSWORD` on every container startup — e.g. `POSTGRES_PASSWORD=05e792e51d`. Hardcoded 'postgres' never matched, so every monitor connect failed with: FATAL: password authentication failed for user "postgres" which increments connect_check_ERR instead of ssl_connections_OK. After the scheduler fix, these auth failures fired every 2 seconds instead of every 120 seconds — but they were still failures, so the counter never advanced. Fix: remove the username/password switch entirely. The default monitor/monitor user is already configured in the infra's pg_hba.conf and authenticates successfully (I verified this manually via `docker exec psql 'host=... user=monitor password=monitor sslmode=require'` from both inside and outside the proxysql container). The corresponding "restore original values" block is also removed since there's nothing to restore. ## Local verification 3 consecutive runs of the full pgsql-servers_ssl_params-t test in legacy-g4 infra with the patched proxysql + patched test binary: attempt 1: PASS attempt 2: PASS attempt 3: PASS === pgsql-servers_ssl_params-t: 3/3 pass === Subtest-level confirmation from the final attempt's TAP log: # Original monitor: user=monitor interval=120000 ms # Initial PgSQL_Monitor_ssl_connections_OK: 33 # After PgSQL_Monitor_ssl_connections_OK: 36 <- +3 in 5 s ok 32 - Monitor SSL counter increased with use_ssl=1 and no per-server row # With TLSv1 per-server pin, ssl OK before wait: 39 # With TLSv1 per-server pin, ssl OK after wait: 39 (delta=0) ok 33 - Monitor per-server: SSL OK counter does NOT advance when per-server row pins ssl_protocol_version_range to TLSv1 # After cleanup, ssl OK recovered from 41 to 44 <- +3 in 5 s ok 34 - Monitor per-server: SSL OK counter resumes advancing after removing the per-server row All three monitor-SSL subtests now exercise the real code path (SSL handshake happening, counter incrementing, per-server pin blocking SSL as designed) instead of observing a no-op. ## Side effect on pgsql-ssl_keylog-t subtest 7 Subtest 7 of pgsql-ssl_keylog-t was marked as SKIP in PR #5612 because it tripped on the same "pgsql monitor isn't making SSL connections" symptom. With this fix merged, the skip's runtime condition (`lines_before_monitor == lines_after_monitor`) will evaluate to false once the monitor is actually producing SSL handshakes, and the test will fall into the `ok(...)` branch automatically. No separate change to that test is needed — the skip was defensive and is dead code after this fix. ## Why mysql monitor is not touched here This fix is scoped to the pgsql monitor scheduler only. The mysql monitor is a different file (lib/MySQL_Monitor.cpp) with a different scheduling architecture (per-thread timers, not a centralized scheduler). If the same class of bug exists there, it would need a separate patch - out of scope for this PR.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughScheduler now clamps next_*_at timestamps when runtime PgSQL thread variables are refreshed so reduced intervals take effect immediately; test setup was simplified to stop changing monitor credentials and only restore the connect interval after tests. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces logic to ensure that reductions in PostgreSQL monitor intervals take effect immediately by clamping the next scheduled execution times. It also updates the pgsql-servers_ssl_params-t.cpp test to use default credentials, avoiding failures in CI environments where passwords are randomized. Feedback indicates that the refresh check should be moved earlier in the scheduler loop to ensure prompt detection of configuration changes. Furthermore, the clamping logic should explicitly handle disabled tasks (interval set to 0) to prevent them from continuing to run under old schedules or entering high-frequency execution loops due to existing bugs in the interval calculation logic.
| if (PgSQL_Thread__variables_version < glover) { | ||
| PgSQL_Thread__variables_version = glover; | ||
| pgsql_thread->refresh_variables(); | ||
| vars_refreshed = true; | ||
| } |
There was a problem hiding this comment.
The variable refresh logic is currently placed inside the if (cur_intv_start >= closest_intv) block. This means that if all monitoring tasks have long intervals (e.g., several minutes), a change to a shorter interval via LOAD PGSQL VARIABLES TO RUNTIME will not be detected until the shortest of the old intervals expires.
To ensure that interval changes are picked up promptly (within the scheduler's 500ms sleep cycle), this refresh check and the subsequent clamping logic should be moved outside and before the closest_intv calculation (around line 2511).
| if (vars_refreshed) { | ||
| if (tasks_conf.ping.params.interval > 0) { | ||
| const uint64_t clamped = cur_intv_start + tasks_conf.ping.params.interval; | ||
| if (next_intvs.next_ping_at > clamped) { | ||
| next_intvs.next_ping_at = clamped; | ||
| } | ||
| } | ||
| if (tasks_conf.connect.params.interval > 0) { | ||
| const uint64_t clamped = cur_intv_start + tasks_conf.connect.params.interval; | ||
| if (next_intvs.next_connect_at > clamped) { | ||
| next_intvs.next_connect_at = clamped; | ||
| } | ||
| } | ||
| if (tasks_conf.readonly.params.interval > 0) { | ||
| const uint64_t clamped = cur_intv_start + tasks_conf.readonly.params.interval; | ||
| if (next_intvs.next_readonly_at > clamped) { | ||
| next_intvs.next_readonly_at = clamped; | ||
| } | ||
| } | ||
| if (tasks_conf.repl_lag.params.interval > 0) { | ||
| const uint64_t clamped = cur_intv_start + tasks_conf.repl_lag.params.interval; | ||
| if (next_intvs.next_repl_lag_at > clamped) { | ||
| next_intvs.next_repl_lag_at = clamped; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
The clamping logic currently only handles cases where the new interval is greater than zero. If an interval is set to 0 (disabled) at runtime, the task will continue to fire at its old interval until it expires. Furthermore, due to a bug in compute_next_intvs (where the else block at line 2449 is unreachable), it may even continue to fire every loop iteration once the old interval expires.
It is recommended to explicitly handle the interval == 0 case here by setting the next execution time to ULONG_MAX to stop the task immediately. This also makes the claim in the comment at lines 2571-2573 accurate.
References
- Ensure that runtime configuration changes, especially those disabling features, take effect immediately and do not leave the system in an inconsistent or high-load state.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
test/tap/tests/pgsql-servers_ssl_params-t.cpp (1)
663-668: Consider removing unusedoriginal_monitor_username.The variable
original_monitor_usernameis read and logged viadiag()but never used for restoration since credentials are no longer modified. While it provides useful diagnostic context, it could be removed to simplify the code if desired.This is a minor observation and the current code is perfectly acceptable for diagnostic purposes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/tap/tests/pgsql-servers_ssl_params-t.cpp` around lines 663 - 668, The test declares original_monitor_username via exec_scalar and passes it to diag but never uses it for restoration; remove the unused variable and its retrieval call (original_monitor_username and the exec_scalar(...) call) and update the diag call in the pgsql-servers_ssl_params-t.cpp test (remove the username %s placeholder and argument) or remove the entire diag line so only original_connect_interval from getConnectInterval(a) is logged; reference symbols: original_monitor_username, exec_scalar, diag, getConnectInterval.lib/PgSQL_Monitor.cpp (1)
2574-2599: Consider extracting the repetitive clamping pattern.The clamping logic is correct and accomplishes the one-way min() semantics as described. The implementation is clear but repetitive across all four task types.
A helper lambda or small function could reduce duplication:
auto clamp_next = [&](uint64_t& next_at, int32_t interval) { if (interval > 0) { const uint64_t clamped = cur_intv_start + interval; if (next_at > clamped) { next_at = clamped; } } }; clamp_next(next_intvs.next_ping_at, tasks_conf.ping.params.interval); clamp_next(next_intvs.next_connect_at, tasks_conf.connect.params.interval); clamp_next(next_intvs.next_readonly_at, tasks_conf.readonly.params.interval); clamp_next(next_intvs.next_repl_lag_at, tasks_conf.repl_lag.params.interval);That said, the current explicit form is easy to understand and maintain, so this is purely a stylistic suggestion.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/PgSQL_Monitor.cpp` around lines 2574 - 2599, Extract the repeated clamping into a small helper (lambda or static inline function) and call it from inside the vars_refreshed block: create e.g. auto clamp_next = [&](uint64_t& next_at, int32_t interval){ if (interval>0){ const uint64_t clamped = cur_intv_start + interval; if (next_at > clamped) next_at = clamped; }} and replace the four duplicated blocks with clamp_next(next_intvs.next_ping_at, tasks_conf.ping.params.interval); clamp_next(next_intvs.next_connect_at, tasks_conf.connect.params.interval); clamp_next(next_intvs.next_readonly_at, tasks_conf.readonly.params.interval); clamp_next(next_intvs.next_repl_lag_at, tasks_conf.repl_lag.params.interval); ensure the helper captures cur_intv_start or accepts it as a parameter as needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@lib/PgSQL_Monitor.cpp`:
- Around line 2574-2599: Extract the repeated clamping into a small helper
(lambda or static inline function) and call it from inside the vars_refreshed
block: create e.g. auto clamp_next = [&](uint64_t& next_at, int32_t interval){
if (interval>0){ const uint64_t clamped = cur_intv_start + interval; if (next_at
> clamped) next_at = clamped; }} and replace the four duplicated blocks with
clamp_next(next_intvs.next_ping_at, tasks_conf.ping.params.interval);
clamp_next(next_intvs.next_connect_at, tasks_conf.connect.params.interval);
clamp_next(next_intvs.next_readonly_at, tasks_conf.readonly.params.interval);
clamp_next(next_intvs.next_repl_lag_at, tasks_conf.repl_lag.params.interval);
ensure the helper captures cur_intv_start or accepts it as a parameter as
needed.
In `@test/tap/tests/pgsql-servers_ssl_params-t.cpp`:
- Around line 663-668: The test declares original_monitor_username via
exec_scalar and passes it to diag but never uses it for restoration; remove the
unused variable and its retrieval call (original_monitor_username and the
exec_scalar(...) call) and update the diag call in the
pgsql-servers_ssl_params-t.cpp test (remove the username %s placeholder and
argument) or remove the entire diag line so only original_connect_interval from
getConnectInterval(a) is logged; reference symbols: original_monitor_username,
exec_scalar, diag, getConnectInterval.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 94b108d0-5d28-4b6d-9bc5-4c2276c361aa
📒 Files selected for processing (2)
lib/PgSQL_Monitor.cpptest/tap/tests/pgsql-servers_ssl_params-t.cpp
📜 Review details
🧰 Additional context used
📓 Path-based instructions (5)
lib/**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (CLAUDE.md)
lib/**/*.{cpp,h,hpp}: Write class names in PascalCase with protocol prefixes (MySQL_,PgSQL_,ProxySQL_)
Use snake_case for member variables
Require C++17 for all code
Use conditional compilation via#ifdef PROXYSQLGENAI,#ifdef PROXYSQL31, and similar feature flags for feature tiers
Use RAII for resource management
Use jemalloc for memory allocation
Use Pthread mutexes for synchronization
Usestd::atomic<>for counters
Files:
lib/PgSQL_Monitor.cpp
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (CLAUDE.md)
Use UPPER_SNAKE_CASE for constants and macros
Files:
lib/PgSQL_Monitor.cpptest/tap/tests/pgsql-servers_ssl_params-t.cpp
lib/**/*.cpp
📄 CodeRabbit inference engine (CLAUDE.md)
One class per file typically in the library
Files:
lib/PgSQL_Monitor.cpp
test/tap/tests/**/*(test_*.cpp|*-t.cpp)
📄 CodeRabbit inference engine (CLAUDE.md)
Test files follow the naming pattern
test_*.cppor*-t.cpp
Files:
test/tap/tests/pgsql-servers_ssl_params-t.cpp
test/tap/**/*.cpp
📄 CodeRabbit inference engine (CLAUDE.md)
Use TAP (Test Anything Protocol) for tests with Docker-based backend infrastructure
Files:
test/tap/tests/pgsql-servers_ssl_params-t.cpp
🧠 Learnings (2)
📚 Learning: 2026-04-11T05:43:20.598Z
Learnt from: CR
Repo: sysown/proxysql PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-11T05:43:20.598Z
Learning: Applies to test/tap/**/*.cpp : Use TAP (Test Anything Protocol) for tests with Docker-based backend infrastructure
Applied to files:
test/tap/tests/pgsql-servers_ssl_params-t.cpp
📚 Learning: 2026-01-20T09:34:19.124Z
Learnt from: yuji-hatakeyama
Repo: sysown/proxysql PR: 5307
File: test/tap/tests/reg_test_5306-show_warnings_with_comment-t.cpp:39-48
Timestamp: 2026-01-20T09:34:19.124Z
Learning: In ProxySQL's TAP test suite, resource leaks (e.g., not calling mysql_close() on early return paths) are commonly tolerated because test processes are short-lived and OS frees resources on exit. This pattern applies to all C++ test files under test/tap/tests. When reviewing, recognize this as a project-wide test convention and focus on test correctness and isolation rather than insisting on fixing such leaks in these test files.
Applied to files:
test/tap/tests/pgsql-servers_ssl_params-t.cpp
🔇 Additional comments (4)
lib/PgSQL_Monitor.cpp (2)
2536-2540: LGTM! Clean flag tracking for variable refresh.The
vars_refreshedflag correctly captures whether runtime variables were reloaded, enabling the subsequent clamping logic to fire only when needed.
2546-2573: Excellent documentation of the subtle scheduling behavior.The comment thoroughly explains the root cause (cached
next_<type>_atvalues not being recomputed until the old cycle fires) and the fix rationale. The worked example with timestamps makes the problem concrete and easy to understand for future maintainers.test/tap/tests/pgsql-servers_ssl_params-t.cpp (2)
651-662: Good documentation of the CI credential issue.The comment clearly explains why the test no longer manipulates monitor credentials - the CI infrastructure randomizes
POSTGRES_PASSWORD, making hardcoded credentials unreliable. This prevents future maintainers from reintroducing the same bug.
677-680: LGTM! Restoration scope correctly narrowed.The restoration logic now only restores
pgsql-monitor_connect_intervalsince that's the only variable the test modifies. This aligns with the decision not to touch credentials.
- Add proxy_debug(PROXY_DEBUG_MONITOR, 5) calls when clamping fires, so the fix is observable in debug logs. Follows existing pattern from nearby scheduling debug output. - Rename original_monitor_username to monitor_username and clarify it is for diagnostic logging only, since the variable is no longer used for restore after the credential change was removed.
|



Closes the
pgsql-servers_ssl_params-tline item on #5610. Fourth and final of the four tests from that issue.This PR is code + test together because the scheduler fix alone wouldn't make the test pass, and the test fix alone is useless without the scheduler fix. Reviewable in two distinct blocks below.
1 / 2 —
lib/PgSQL_Monitor.cpp: scheduler clamp on interval changeProblem
pgsql-servers_ssl_params-tsubtests 32 and 34 expectPgSQL_Monitor_ssl_connections_OKto increase a few seconds after:but the counter stays at 0 for ~2 minutes. Root cause is state caching in
PgSQL_monitor_scheduler_thread():connect_interval=120000. Scheduler's first tick schedules initial connect check;compute_next_intvs()setsnext_connect_at = T + 120000 ms.SET connect_interval=2000; LOAD PGSQL VARIABLES TO RUNTIME;.fetch_updated_conf()now returnsinterval=2000on next iteration, BUTnext_connect_atstill points atT+120s.ssl_connections_OK— still 0 because the next scheduled connect check is ~85 seconds away.compute_next_intvs()only advancesnext_<type>_atwhen the type has just fired. It never recomputes in response to a variable change, so shortening an interval at runtime has zero effect until the already-scheduled (long) cycle naturally elapses.Fix
In the scheduler loop, after
fetch_updated_conf()when runtime variables were just refreshed (tracked by comparingPgSQL_Thread__variables_versionagainst the global version), clamp eachnext_<type>_atdown tocur_intv_start + new_intervalwhen that would schedule the next check sooner than the currently-cached value:Properties:
next_<type>_at, never pushes it out. Growing the interval at runtime does not delay an already-imminent check.interval == 0) are skipped and continue to be handled bycompute_next_intvs()which sets them toULONG_MAX.connectwould leave latent issues for tests that changeping_interval,read_only_interval, orreplication_lag_intervalat runtime.Raw verification (before the test fix)
Direct experiment against the patched binary, no test file involved:
Before the fix:
ssl_connections_OKstuck at 0 for ~2 minutes.After the fix:
ssl_connections_OKgoes0 → 2 → 6over 2 seconds, matching the 2000 ms interval × 2 monitor threads.2 / 2 —
test/tap/tests/pgsql-servers_ssl_params-t.cpp: drop hardcodedpostgres/postgrescredentialsProblem
With only the scheduler fix applied, the test was still failing because its
main()did:on the assumption that the backend had a postgres user with password
"postgres". But the actual CI infra (test/infra/docker-pgsql16-single) randomizesPOSTGRES_PASSWORDon every container startup:So the hardcoded
"postgres"password never authenticated, and every monitor connect failed with:These auth failures incremented
connect_check_ERRinstead ofssl_connections_OK. After the scheduler fix, the auth failures fire every 2 seconds (proof the scheduler fix works!) but they're still failures, so the counter the test is checking never advances.Fix
Remove the monitor username/password switch entirely. The default
monitor/monitoruser is already configured in the infra'spg_hba.confand authenticates successfully (verified manually viadocker exec psql 'host=... user=monitor password=monitor sslmode=require'from both inside and outside the proxysql container). Also remove the paired "restore original values" block since there's nothing to restore.Local verification (both fixes together)
3 consecutive iterations of the full test in fresh
legacy-g4infra using thetest/README.md§"Debugging a flaky test" recipe:Subtest-level confirmation from the final attempt's TAP log:
All three monitor-SSL subtests now exercise the real code path (SSL handshake, counter incrementing, per-server pin blocking SSL as designed). Pre-fix the counters were observing a no-op and the test was deterministically failing.
Side effect on PR #5612
Subtest 7 of
pgsql-ssl_keylog-twas marked as SKIP in #5612 because it tripped on the same "pgsql monitor isn't making SSL connections" symptom. With this PR merged, the skip's runtime condition (lines_before_monitor == lines_after_monitor) evaluates to false once the monitor is actually producing SSL handshakes, and the test falls into the realok(...)branch automatically. No separate change topgsql-ssl_keylog-tis needed — the skip was defensive and becomes dead code after this fix.Scope
lib/PgSQL_Monitor.cpp). The mysql monitor (lib/MySQL_Monitor.cpp) has a different scheduling architecture (per-thread timers, not a centralized scheduler) and may or may not have the same class of bug — out of scope for this PR. Anyone investigating similar symptoms on the mysql side should use this fix as a template but verify the scheduling model before porting.pgsql-servers_ssl_params-t.cpp.Test plan
SET + LOADdelta=0— the fix does not accidentally make failing cases "succeed"CI-legacy-g4against this PR's HEADpgsql-ssl_keylog-tsubtest 7 passing (no longer hitting its skip fallback), confirming the side-effect on PR test(pgsql-ssl_keylog): fix container-local path + NSS label regex #5612 is realCloses
pgsql-servers_ssl_params-t(4th and final test from the tracking issue; test(flush_logs): harden timing race — poll instead of fixed sleep #5611, test(pgsql-ssl_keylog): fix container-local path + NSS label regex #5612, test(read_only_offline_hard): set default_hostgroup to WHG, not hardcoded 0 #5613 handled the other three)Summary by CodeRabbit