test(flush_logs): harden timing race — poll instead of fixed sleep#5611
test(flush_logs): harden timing race — poll instead of fixed sleep#5611renecannao merged 1 commit intov3.0from
Conversation
Part of #5610 (flaky TAP tests revealed by cascade fix). Addresses the CI-legacy-g3 'test_flush_logs-t' flake that failed on some PR #5596 commits but passed on others despite identical code. ## Root cause The test measures "does command X cause proxysql.log to gain one new '[INFO] ProxySQL version' marker line". The old flow used fixed sleeps: BASELINE=$(fn_get_rotations) # (includes sleep 1) fn_padmin "PROXYSQL FLUSH LOGS;" RES=$(fn_get_rotations) # (sleep 1, then count) fn_check_res # strict equality check For the SIGUSR1 subtest the flow was worse because the signal had to travel through a "Scheduler Hack" (fn_signal inserts a scheduler row that runs `kill -USR1 $(pidof proxysql)` every 2000 ms, waits 4 s, then deletes the row). Total wall-clock budget from signal scheduling to log read: 5 s including the inner fn_get_rotations sleep. On a slow / contended CI runner any one of three asynchronous stages can slip past that 5 s window: 1. The scheduler's first interval fires slightly late. 2. The SIGUSR1 handler in proxysql queues the log rotation on a worker thread; the actual rotation + re-open + banner write happens after the signal is delivered, not synchronously. 3. The host-side read of proxysql.log via the docker volume bind-mount can lag the in-container write by up to a second on Azure-hosted runners under load. When any of those raced past 5 s, fn_get_rotations returned the pre-rotation count, fn_check_res saw `RES != BASELINE+1`, the test flagged a failure, and CI went red. Same code ran fine on the next retrigger because the race was environmental, not logical. A second, subtler issue: the strict `-ne` equality check in fn_check_res would ALSO fail if the scheduler fired twice before the script got around to deleting its row (interval_ms was 2000, sleep was 4 s - one extra fire was a coin flip on fast runners). One signal is the contract; two rotations from one signal is still the correct observable answer for "did this trigger at least one rotation". ## Fix Two changes, both in the .sh (the -t file is a gitignored copy regenerated by `make sh-test_flush_logs-t.sh`, which itself just `cp`s the .sh — same rule applies to every other `*-t.sh` test): 1. Replace fixed sleeps with bounded polling: - fn_get_rotations no longer has an internal `sleep 1` — it is now a pure "read the current count" function. - New fn_wait_for_rotation_at_least <target> <max_seconds> polls fn_get_rotations every 500 ms until the count reaches the target or the timeout expires. On success it prints the observed count and returns 0; on timeout it prints whatever it last saw and returns 1, so fn_check_res still reports a useful "BASELINE X expected Y got Z" error message. - Max budget raised to 30 s (was 5 s). On a fast machine the poll still terminates within one 500 ms cycle, so total test time on the local dev workstation dropped from ~9 s to ~1 s. 2. Accept "at least BASELINE+1" instead of "exactly BASELINE+1": - fn_check_res uses `-lt` instead of `-ne`. - Test semantics: "did the command trigger at least one rotation". Two rotations from one signal is still a pass (happens only in the scheduler-hack corner, when the scheduler fires twice before fn_signal_cleanup tears it down; harmless to proxysql behavior). - Also drops the scheduler interval from 2000 ms to 500 ms so the first fire happens quickly under normal conditions. 3. Split fn_signal into fn_signal (install only) and fn_signal_cleanup (tear down the scheduler row). The caller now: fn_signal SIGUSR1 RES=$(fn_wait_for_rotation_at_least $((BASELINE+1)) 30) fn_signal_cleanup fn_check_res ... which is race-free: the cleanup runs AFTER the rotation has been observed, so the scheduler row exists only for the bounded window between install and observation. ## Local verification Reproduced the original PR #5596 failure context (legacy-g3 group with only test_flush_logs-t enabled via TEST_PY_TAP_INCL), ran 10 back-to-back iterations against a single infra lifecycle following the recipe in test/README.md §"Debugging a flaky test": for i in 1..10: ./test/infra/control/run-tests-isolated.bash Results: - 10/10 pass - Per-attempt wall time dropped from ~9 s to ~1 s - Confirmed new code path active via the new "got N (>= BASELINE+1)" message format in the captured TAP output ## What's still TODO in #5610 The other three flaky tests from the issue are C++ and each needs its own targeted investigation. They are NOT touched here: - pgsql-servers_ssl_params-t - pgsql-ssl_keylog-t - test_read_only_actions_offline_hard_servers-t Deliberately keeping this PR small - one test per PR so each fix can be reviewed and validated independently.
📝 WalkthroughWalkthroughThis change refactors a test script to replace fixed timing delays with dynamic polling mechanisms. The scheduler interval for signal handling is reduced from 2000ms to 500ms, and new helper functions enable waiting for rotation count changes and cleaning up test state. Assertion criteria are relaxed from exact equality to allowing any count above baseline. Changes
Possibly related issues
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
There was a problem hiding this comment.
Code Review
This pull request improves the reliability of the test_flush_logs-t.sh test by replacing fixed sleeps with a polling mechanism and relaxing strict equality checks to prevent flakiness on slow CI runners. It also introduces a dedicated cleanup function for the scheduler hack. Review feedback suggests optimizing database interactions by combining INSERT/DELETE and LOAD commands into single calls and ensuring robust numeric parsing by stripping whitespace from command output.
| fn_padmin "INSERT OR REPLACE INTO scheduler (id, active, interval_ms, filename, arg1, arg2) VALUES (9999, 1, 500, '/bin/sh', '-c', 'kill -${sig#SIG} \$(pidof proxysql)');" | ||
| fn_padmin "LOAD SCHEDULER TO RUNTIME;" |
There was a problem hiding this comment.
Combining the INSERT and LOAD statements into a single fn_padmin call is more efficient as it avoids the overhead of establishing multiple database connections and invoking the mysql client process twice.
| fn_padmin "INSERT OR REPLACE INTO scheduler (id, active, interval_ms, filename, arg1, arg2) VALUES (9999, 1, 500, '/bin/sh', '-c', 'kill -${sig#SIG} \$(pidof proxysql)');" | |
| fn_padmin "LOAD SCHEDULER TO RUNTIME;" | |
| fn_padmin "INSERT OR REPLACE INTO scheduler (id, active, interval_ms, filename, arg1, arg2) VALUES (9999, 1, 500, '/bin/sh', '-c', 'kill -${sig#SIG} \$(pidof proxysql)'); LOAD SCHEDULER TO RUNTIME;" |
| fn_padmin "DELETE FROM scheduler WHERE id=9999;" >/dev/null 2>&1 || true | ||
| fn_padmin "LOAD SCHEDULER TO RUNTIME;" >/dev/null 2>&1 || true |
There was a problem hiding this comment.
Similar to the fn_signal function, combining the DELETE and LOAD statements here reduces process and connection overhead during the test cleanup phase.
| fn_padmin "DELETE FROM scheduler WHERE id=9999;" >/dev/null 2>&1 || true | |
| fn_padmin "LOAD SCHEDULER TO RUNTIME;" >/dev/null 2>&1 || true | |
| fn_padmin "DELETE FROM scheduler WHERE id=9999; LOAD SCHEDULER TO RUNTIME;" >/dev/null 2>&1 || true |
| local count=0 | ||
| local max_iterations=$(( max_seconds * 2 )) | ||
| while [ $i -lt $max_iterations ]; do | ||
| count=$(fn_get_rotations) |
There was a problem hiding this comment.
The output of wc -l (used inside fn_get_rotations) often contains leading or trailing whitespace depending on the platform's implementation. Stripping this whitespace ensures that the count variable contains a clean integer, which makes the subsequent numeric comparisons and TAP output more robust.
| count=$(fn_get_rotations) | |
| count=$(fn_get_rotations | tr -d '[:space:]') |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/tap/tests/test_flush_logs-t.sh`:
- Around line 92-98: The test installs a scheduler row id=9999 that can remain
active if the script exits or is interrupted; add a best-effort cleanup from
fn_exit so that the scheduler row and runtime are removed even on unexpected
exits. Modify or ensure the EXIT/INT/TERM trap calls fn_signal_cleanup (or add a
call to fn_signal_cleanup inside fn_exit) and make fn_signal_cleanup resilient
(silently ignore failures) when calling fn_padmin "DELETE FROM scheduler WHERE
id=9999;" and "LOAD SCHEDULER TO RUNTIME;" so the 500ms job won't continue
sending SIGUSR1 to ProxySQL after the test exits.
- Around line 135-149: The polling loop computes max_iterations as max_seconds *
2 which causes the last check to occur at max_seconds - 0.5, missing events that
happen right before the advertised timeout; update the max_iterations
calculation to allow one extra half-second iteration (e.g., add 1) so the final
fn_get_rotations check runs at max_seconds, ensuring the loop (variables
max_iterations, max_seconds, i and the fn_get_rotations call) will observe
rotations that land just before the timeout.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 267ebfd1-e1fd-48d4-b86b-395f410f585e
📒 Files selected for processing (1)
test/tap/tests/test_flush_logs-t.sh
📜 Review details
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: yuji-hatakeyama
Repo: sysown/proxysql PR: 5307
File: test/tap/tests/reg_test_5306-show_warnings_with_comment-t.cpp:39-48
Timestamp: 2026-01-20T09:34:27.165Z
Learning: In ProxySQL test files (test/tap/tests/), resource leaks (such as not calling `mysql_close()` on early return paths) are not typically fixed because test processes are short-lived and the OS frees resources on process exit. This is a common pattern across the test suite.
📚 Learning: 2026-04-01T21:27:03.216Z
Learnt from: wazir-ahmed
Repo: sysown/proxysql PR: 5557
File: test/tap/tests/unit/gtid_set_unit-t.cpp:14-17
Timestamp: 2026-04-01T21:27:03.216Z
Learning: In ProxySQL's unit test directory (test/tap/tests/unit/), test_globals.h and test_init.h are only required for tests that depend on the ProxySQL runtime globals/initialization (i.e., tests that exercise components linked against libproxysql.a). Pure data-structure or utility tests (e.g., ezoption_parser_unit-t.cpp, gtid_set_unit-t.cpp, gtid_trxid_interval_unit-t.cpp) only need tap.h and the relevant project header — omitting test_globals.h and test_init.h is correct and intentional in these cases.
Applied to files:
test/tap/tests/test_flush_logs-t.sh
📚 Learning: 2026-01-20T09:34:27.165Z
Learnt from: yuji-hatakeyama
Repo: sysown/proxysql PR: 5307
File: test/tap/tests/reg_test_5306-show_warnings_with_comment-t.cpp:39-48
Timestamp: 2026-01-20T09:34:27.165Z
Learning: In ProxySQL test files (test/tap/tests/), resource leaks (such as not calling `mysql_close()` on early return paths) are not typically fixed because test processes are short-lived and the OS frees resources on process exit. This is a common pattern across the test suite.
Applied to files:
test/tap/tests/test_flush_logs-t.sh
📚 Learning: 2026-01-20T07:40:34.938Z
Learnt from: yuji-hatakeyama
Repo: sysown/proxysql PR: 5307
File: test/tap/tests/reg_test_5306-show_warnings_with_comment-t.cpp:24-28
Timestamp: 2026-01-20T07:40:34.938Z
Learning: In ProxySQL test files, calling `mysql_error(NULL)` after `mysql_init()` failure is safe because the MariaDB client library implementation returns an empty string for NULL handles (not undefined behavior).
Applied to files:
test/tap/tests/test_flush_logs-t.sh
🪛 Shellcheck (0.11.0)
test/tap/tests/test_flush_logs-t.sh
[info] 139-139: Double quote to prevent globbing and word splitting.
(SC2086)
[info] 148-148: Double quote to prevent globbing and word splitting.
(SC2086)
[style] 153-153:
(SC2004)
[info] 154-154: Double quote to prevent globbing and word splitting.
(SC2086)
[style] 160-160:
(SC2004)
[style] 161-161:
(SC2004)
[style] 162-162:
(SC2004)
[style] 176-176:
(SC2004)
[style] 182-182:
(SC2004)
🔇 Additional comments (2)
test/tap/tests/test_flush_logs-t.sh (2)
77-90: Good call moving teardown out offn_signal.Keeping the scheduler row alive until the effect is observed removes the original “delete before first fire” race.
155-165:>= BASELINE + 1matches the new test semantics.Allowing extra rotations here is the right trade-off once the scheduler path can legitimately fire more than once before cleanup.
| fn_signal_cleanup () { | ||
| # Tear down any scheduler row installed by fn_signal. Safe to call | ||
| # unconditionally - if fn_signal used the docker path instead of the | ||
| # scheduler hack, this is a no-op on a non-existent row. | ||
| fn_padmin "DELETE FROM scheduler WHERE id=9999;" >/dev/null 2>&1 || true | ||
| fn_padmin "LOAD SCHEDULER TO RUNTIME;" >/dev/null 2>&1 || true | ||
| } |
There was a problem hiding this comment.
Add a best-effort cleanup from fn_exit.
If the script exits or is interrupted after installing scheduler row 9999 but before the explicit cleanup call, that 500 ms job keeps sending SIGUSR1 to the live ProxySQL instance. This can bleed into later TAP tests and make failures non-local.
Suggested guardrail
fn_exit () {
trap - EXIT
trap - SIGINT
+ fn_signal_cleanup >/dev/null 2>&1 || true
if [[ $DONE -eq $PLAN ]] && [[ $FAIL -eq 0 ]]; then
echo "msg: Test took $SECONDS sec"
exit 0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/tap/tests/test_flush_logs-t.sh` around lines 92 - 98, The test installs
a scheduler row id=9999 that can remain active if the script exits or is
interrupted; add a best-effort cleanup from fn_exit so that the scheduler row
and runtime are removed even on unexpected exits. Modify or ensure the
EXIT/INT/TERM trap calls fn_signal_cleanup (or add a call to fn_signal_cleanup
inside fn_exit) and make fn_signal_cleanup resilient (silently ignore failures)
when calling fn_padmin "DELETE FROM scheduler WHERE id=9999;" and "LOAD
SCHEDULER TO RUNTIME;" so the 500ms job won't continue sending SIGUSR1 to
ProxySQL after the test exits.
| local max_iterations=$(( max_seconds * 2 )) | ||
| while [ $i -lt $max_iterations ]; do | ||
| count=$(fn_get_rotations) | ||
| if [ "$count" -ge "$target" ]; then | ||
| echo $count | ||
| return 0 | ||
| fi | ||
| sleep 0.5 | ||
| i=$(( i + 1 )) | ||
| done | ||
| # Timeout. Emit whatever we last observed so fn_check_res will | ||
| # report a useful "BASELINE X expected Y got Z" message rather than | ||
| # an empty RES. | ||
| echo $count | ||
| return 1 |
There was a problem hiding this comment.
The polling loop misses the last eligible sample.
With max_iterations=$(( max_seconds * 2 )), the final check happens at max_seconds - 0.5, not at max_seconds. A rotation that lands just before the advertised timeout still fails, so the flake window is smaller but not fully closed.
Minimal fix
- local max_iterations=$(( max_seconds * 2 ))
+ local max_iterations=$(( max_seconds * 2 + 1 ))
while [ $i -lt $max_iterations ]; do📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| local max_iterations=$(( max_seconds * 2 )) | |
| while [ $i -lt $max_iterations ]; do | |
| count=$(fn_get_rotations) | |
| if [ "$count" -ge "$target" ]; then | |
| echo $count | |
| return 0 | |
| fi | |
| sleep 0.5 | |
| i=$(( i + 1 )) | |
| done | |
| # Timeout. Emit whatever we last observed so fn_check_res will | |
| # report a useful "BASELINE X expected Y got Z" message rather than | |
| # an empty RES. | |
| echo $count | |
| return 1 | |
| local max_iterations=$(( max_seconds * 2 + 1 )) | |
| while [ $i -lt $max_iterations ]; do | |
| count=$(fn_get_rotations) | |
| if [ "$count" -ge "$target" ]; then | |
| echo $count | |
| return 0 | |
| fi | |
| sleep 0.5 | |
| i=$(( i + 1 )) | |
| done | |
| # Timeout. Emit whatever we last observed so fn_check_res will | |
| # report a useful "BASELINE X expected Y got Z" message rather than | |
| # an empty RES. | |
| echo $count | |
| return 1 |
🧰 Tools
🪛 Shellcheck (0.11.0)
[info] 139-139: Double quote to prevent globbing and word splitting.
(SC2086)
[info] 148-148: Double quote to prevent globbing and word splitting.
(SC2086)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/tap/tests/test_flush_logs-t.sh` around lines 135 - 149, The polling loop
computes max_iterations as max_seconds * 2 which causes the last check to occur
at max_seconds - 0.5, missing events that happen right before the advertised
timeout; update the max_iterations calculation to allow one extra half-second
iteration (e.g., add 1) so the final fn_get_rotations check runs at max_seconds,
ensuring the loop (variables max_iterations, max_seconds, i and the
fn_get_rotations call) will observe rotations that land just before the timeout.



Closes #5610 (partial — one of four tests).
Summary
Fixes
test_flush_logs-tby replacing fixed sleeps with bounded polling. The old code depended on specific wall-clock windows (sleep 1,sleep 4) that occasionally slipped past on slow / contended GH runners, producing false failures on CI while the same code passed cleanly on every local workstation and on commit-after-commit empty retriggers in PR #5596.Root cause
test_flush_logs-t.shmeasures "did this command rotateproxysql.log?" by counting the number of[INFO] ProxySQL versionmarker lines before and after. The old flow was timing-dependent:For the SIGUSR1 subtest it was worse because the signal was delivered via a "Scheduler Hack" (a 2000ms-interval scheduler row that
kill -USR1 $(pidof proxysql)), plus a fixedsleep 4, plus the innersleep 1— total 5 s budget from signal scheduling to reading the log. Three async stages could slip past that window: the scheduler's first interval firing slightly late, the SIGUSR1 handler queueing the rotation on a worker thread, and the host-side volume bind-mount read lag. Any one blew the budget →RES != BASELINE+1→ false failure.A secondary issue: strict
-neequality check would also fail if the scheduler hack fired twice before the cleanup tore its row down (interval 2000ms, sleep 4s → exactly one extra fire window). Two rotations from one signal is still a pass semantically.Fix
Three changes in
test/tap/tests/test_flush_logs-t.sh:Replace fixed sleeps with bounded polling. New helper
fn_wait_for_rotation_at_least <target> <max_seconds>pollsfn_get_rotationsevery 500 ms until the count reaches the target or the timeout expires.fn_get_rotationsis now sleepless (callers that need to race wait use the new helper). Max budget raised to 30 s (was effectively 5 s) — on fast machines the poll still terminates within one 500 ms cycle.Accept "at least BASELINE+1" instead of exact equality.
fn_check_resuses-ltinstead of-ne. The test asks "did it trigger at least one rotation?" — two rotations from one signal is still a pass.Split
fn_signalintofn_signal(install only) andfn_signal_cleanup(teardown). Caller now races cleanly:Scheduler row exists only for the bounded window between install and observation. Also dropped the scheduler interval from 2000 ms to 500 ms so the first fire happens quickly.
Local verification (dogfooding the new flake-debug recipe from #5609)
Reproduced the original PR #5596 failure context (
legacy-g3group with justtest_flush_logs-tviaTEST_PY_TAP_INCL), 10 back-to-back iterations against one infra lifecycle pertest/README.md§"Debugging a flaky test":Per-attempt wall time dropped from ~9 s to ~1 s because we're no longer burning fixed sleeps. Confirmed the new code path is active via the new TAP message format in the captured log:
Scope
Just this one test. The other three flaky tests listed in #5610 (
pgsql-servers_ssl_params-t,pgsql-ssl_keylog-t,test_read_only_actions_offline_hard_servers-t) are C++ and each warrants its own investigation + PR — not worth batching into one "fix all flakes" mega-patch.Test plan
bash -n test/tap/tests/test_flush_logs-t.sh— script syntax validlegacy-g3 + TEST_PY_TAP_INCL=test_flush_logs-tCI-legacy-g3against this PR's HEADCI-legacy-g3on v3.0 main — expect thetest_flush_logs-trow to disappear from the flake listNote on the
-tvs.shfile dualityFor anyone who finds this PR puzzling:
test/tap/tests/test_flush_logs-tis NOT the same file astest_flush_logs-t.sh. The-tfile is a copy produced by thesh-%rule intest/tap/tests/Makefile(literallycp test_flush_logs-t.sh test_flush_logs-t && chmod +x). The runner invokes*-t, not*.sh, and the-tfiles are gitignored. So the only tracked source of truth istest_flush_logs-t.sh, and this PR modifies only that file. The-tcopy is regenerated bymake tests-shor the top-level TAP build. I hit this the hard way during local verification — documented here so the next person doesn't.Summary by CodeRabbit