Skip to content

test(flush_logs): harden timing race — poll instead of fixed sleep#5611

Merged
renecannao merged 1 commit intov3.0from
v3.0-fix-flake-test-flush-logs
Apr 11, 2026
Merged

test(flush_logs): harden timing race — poll instead of fixed sleep#5611
renecannao merged 1 commit intov3.0from
v3.0-fix-flake-test-flush-logs

Conversation

@renecannao
Copy link
Copy Markdown
Contributor

@renecannao renecannao commented Apr 11, 2026

Closes #5610 (partial — one of four tests).

Summary

Fixes test_flush_logs-t by replacing fixed sleeps with bounded polling. The old code depended on specific wall-clock windows (sleep 1, sleep 4) that occasionally slipped past on slow / contended GH runners, producing false failures on CI while the same code passed cleanly on every local workstation and on commit-after-commit empty retriggers in PR #5596.

Root cause

test_flush_logs-t.sh measures "did this command rotate proxysql.log?" by counting the number of [INFO] ProxySQL version marker lines before and after. The old flow was timing-dependent:

BASELINE=$(fn_get_rotations)       # includes sleep 1
fn_padmin "PROXYSQL FLUSH LOGS;"
RES=$(fn_get_rotations)            # sleep 1, then count
fn_check_res                       # strict equality

For the SIGUSR1 subtest it was worse because the signal was delivered via a "Scheduler Hack" (a 2000ms-interval scheduler row that kill -USR1 $(pidof proxysql)), plus a fixed sleep 4, plus the inner sleep 1 — total 5 s budget from signal scheduling to reading the log. Three async stages could slip past that window: the scheduler's first interval firing slightly late, the SIGUSR1 handler queueing the rotation on a worker thread, and the host-side volume bind-mount read lag. Any one blew the budget → RES != BASELINE+1 → false failure.

A secondary issue: strict -ne equality check would also fail if the scheduler hack fired twice before the cleanup tore its row down (interval 2000ms, sleep 4s → exactly one extra fire window). Two rotations from one signal is still a pass semantically.

Fix

Three changes in test/tap/tests/test_flush_logs-t.sh:

  1. Replace fixed sleeps with bounded polling. New helper fn_wait_for_rotation_at_least <target> <max_seconds> polls fn_get_rotations every 500 ms until the count reaches the target or the timeout expires. fn_get_rotations is now sleepless (callers that need to race wait use the new helper). Max budget raised to 30 s (was effectively 5 s) — on fast machines the poll still terminates within one 500 ms cycle.

  2. Accept "at least BASELINE+1" instead of exact equality. fn_check_res uses -lt instead of -ne. The test asks "did it trigger at least one rotation?" — two rotations from one signal is still a pass.

  3. Split fn_signal into fn_signal (install only) and fn_signal_cleanup (teardown). Caller now races cleanly:

    fn_signal SIGUSR1
    RES=$(fn_wait_for_rotation_at_least $((BASELINE+1)) 30)
    fn_signal_cleanup                # tear down AFTER rotation observed
    fn_check_res

    Scheduler row exists only for the bounded window between install and observation. Also dropped the scheduler interval from 2000 ms to 500 ms so the first fire happens quickly.

Local verification (dogfooding the new flake-debug recipe from #5609)

Reproduced the original PR #5596 failure context (legacy-g3 group with just test_flush_logs-t via TEST_PY_TAP_INCL), 10 back-to-back iterations against one infra lifecycle per test/README.md §"Debugging a flaky test":

attempt 1:  PASS
attempt 2:  PASS
…
attempt 10: PASS

=== total: 10/10 pass ===

Per-attempt wall time dropped from ~9 s to ~1 s because we're no longer burning fixed sleeps. Confirmed the new code path is active via the new TAP message format in the captured log:

msg: ok 1 - command 'PROXYSQL FLUSH LOGS;' - BASELINE: 58 - got 59 (>= BASELINE + 1)
msg: ok 2 - command 'kill -s SIGUSR1 $PID' - BASELINE: 59 - got 60 (>= BASELINE + 1)
msg: Test took 1 sec

Scope

Just this one test. The other three flaky tests listed in #5610 (pgsql-servers_ssl_params-t, pgsql-ssl_keylog-t, test_read_only_actions_offline_hard_servers-t) are C++ and each warrants its own investigation + PR — not worth batching into one "fix all flakes" mega-patch.

Test plan

  • bash -n test/tap/tests/test_flush_logs-t.sh — script syntax valid
  • Local stress test: 10/10 passes in legacy-g3 + TEST_PY_TAP_INCL=test_flush_logs-t
  • TAP output shows new message format, confirming polling code is executing
  • Wall time dropped from 9 s to 1 s (evidence that fixed sleeps are gone, not just more tolerant)
  • CI green on CI-legacy-g3 against this PR's HEAD
  • After merge, watch 5-10 subsequent runs of CI-legacy-g3 on v3.0 main — expect the test_flush_logs-t row to disappear from the flake list

Note on the -t vs .sh file duality

For anyone who finds this PR puzzling: test/tap/tests/test_flush_logs-t is NOT the same file as test_flush_logs-t.sh. The -t file is a copy produced by the sh-% rule in test/tap/tests/Makefile (literally cp test_flush_logs-t.sh test_flush_logs-t && chmod +x). The runner invokes *-t, not *.sh, and the -t files are gitignored. So the only tracked source of truth is test_flush_logs-t.sh, and this PR modifies only that file. The -t copy is regenerated by make tests-sh or the top-level TAP build. I hit this the hard way during local verification — documented here so the next person doesn't.

Summary by CodeRabbit

  • Tests
    • Enhanced test infrastructure for log flushing operations with improved timing reliability and robustness
    • Replaced fixed timing assumptions with dynamic polling mechanisms for more consistent test execution
    • Added helper utilities for cleaner test setup and teardown procedures

Part of #5610 (flaky TAP tests revealed by cascade fix).
Addresses the CI-legacy-g3 'test_flush_logs-t' flake that failed on
some PR #5596 commits but passed on others despite identical code.

## Root cause

The test measures "does command X cause proxysql.log to gain one new
'[INFO] ProxySQL version' marker line". The old flow used fixed sleeps:

    BASELINE=$(fn_get_rotations)       # (includes sleep 1)
    fn_padmin "PROXYSQL FLUSH LOGS;"
    RES=$(fn_get_rotations)            # (sleep 1, then count)
    fn_check_res                       # strict equality check

For the SIGUSR1 subtest the flow was worse because the signal had to
travel through a "Scheduler Hack" (fn_signal inserts a scheduler row
that runs `kill -USR1 $(pidof proxysql)` every 2000 ms, waits 4 s,
then deletes the row). Total wall-clock budget from signal scheduling
to log read: 5 s including the inner fn_get_rotations sleep.

On a slow / contended CI runner any one of three asynchronous stages
can slip past that 5 s window:

  1. The scheduler's first interval fires slightly late.
  2. The SIGUSR1 handler in proxysql queues the log rotation on a
     worker thread; the actual rotation + re-open + banner write
     happens after the signal is delivered, not synchronously.
  3. The host-side read of proxysql.log via the docker volume
     bind-mount can lag the in-container write by up to a second
     on Azure-hosted runners under load.

When any of those raced past 5 s, fn_get_rotations returned the
pre-rotation count, fn_check_res saw `RES != BASELINE+1`, the test
flagged a failure, and CI went red. Same code ran fine on the next
retrigger because the race was environmental, not logical.

A second, subtler issue: the strict `-ne` equality check in
fn_check_res would ALSO fail if the scheduler fired twice before the
script got around to deleting its row (interval_ms was 2000, sleep
was 4 s - one extra fire was a coin flip on fast runners). One signal
is the contract; two rotations from one signal is still the correct
observable answer for "did this trigger at least one rotation".

## Fix

Two changes, both in the .sh (the -t file is a gitignored copy
regenerated by `make sh-test_flush_logs-t.sh`, which itself just
`cp`s the .sh — same rule applies to every other `*-t.sh` test):

1. Replace fixed sleeps with bounded polling:
   - fn_get_rotations no longer has an internal `sleep 1` — it is now
     a pure "read the current count" function.
   - New fn_wait_for_rotation_at_least <target> <max_seconds> polls
     fn_get_rotations every 500 ms until the count reaches the
     target or the timeout expires. On success it prints the observed
     count and returns 0; on timeout it prints whatever it last saw
     and returns 1, so fn_check_res still reports a useful
     "BASELINE X expected Y got Z" error message.
   - Max budget raised to 30 s (was 5 s). On a fast machine the poll
     still terminates within one 500 ms cycle, so total test time on
     the local dev workstation dropped from ~9 s to ~1 s.

2. Accept "at least BASELINE+1" instead of "exactly BASELINE+1":
   - fn_check_res uses `-lt` instead of `-ne`.
   - Test semantics: "did the command trigger at least one rotation".
     Two rotations from one signal is still a pass (happens only in
     the scheduler-hack corner, when the scheduler fires twice before
     fn_signal_cleanup tears it down; harmless to proxysql behavior).
   - Also drops the scheduler interval from 2000 ms to 500 ms so the
     first fire happens quickly under normal conditions.

3. Split fn_signal into fn_signal (install only) and fn_signal_cleanup
   (tear down the scheduler row). The caller now:
     fn_signal SIGUSR1
     RES=$(fn_wait_for_rotation_at_least $((BASELINE+1)) 30)
     fn_signal_cleanup
     fn_check_res ...
   which is race-free: the cleanup runs AFTER the rotation has been
   observed, so the scheduler row exists only for the bounded window
   between install and observation.

## Local verification

Reproduced the original PR #5596 failure context (legacy-g3 group
with only test_flush_logs-t enabled via TEST_PY_TAP_INCL), ran 10
back-to-back iterations against a single infra lifecycle following
the recipe in test/README.md §"Debugging a flaky test":

    for i in 1..10: ./test/infra/control/run-tests-isolated.bash

Results:
  - 10/10 pass
  - Per-attempt wall time dropped from ~9 s to ~1 s
  - Confirmed new code path active via the new "got N (>= BASELINE+1)"
    message format in the captured TAP output

## What's still TODO in #5610

The other three flaky tests from the issue are C++ and each needs
its own targeted investigation. They are NOT touched here:
  - pgsql-servers_ssl_params-t
  - pgsql-ssl_keylog-t
  - test_read_only_actions_offline_hard_servers-t

Deliberately keeping this PR small - one test per PR so each fix
can be reviewed and validated independently.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 11, 2026

📝 Walkthrough

Walkthrough

This change refactors a test script to replace fixed timing delays with dynamic polling mechanisms. The scheduler interval for signal handling is reduced from 2000ms to 500ms, and new helper functions enable waiting for rotation count changes and cleaning up test state. Assertion criteria are relaxed from exact equality to allowing any count above baseline.

Changes

Cohort / File(s) Summary
Test Robustness Improvements
test/tap/tests/test_flush_logs-t.sh
Replaced fixed sleep statements with polling-based fn_wait_for_rotation_at_least function; reduced scheduler interval from 2000ms to 500ms; added fn_signal_cleanup helper for state cleanup; modified fn_check_res to accept rotation counts >= baseline instead of exact equality; adjusted test sections to use 30-second polling timeouts.

Possibly related issues

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 No more sleepy waits that race the clock,
With polling now, our tests stay on the block!
Five hundred mills, a quicker beat,
Rotation checks made safe and fleet! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: replacing fixed sleep timing with polling in the test_flush_logs test.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch v3.0-fix-flake-test-flush-logs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves the reliability of the test_flush_logs-t.sh test by replacing fixed sleeps with a polling mechanism and relaxing strict equality checks to prevent flakiness on slow CI runners. It also introduces a dedicated cleanup function for the scheduler hack. Review feedback suggests optimizing database interactions by combining INSERT/DELETE and LOAD commands into single calls and ensuring robust numeric parsing by stripping whitespace from command output.

Comment on lines +88 to 89
fn_padmin "INSERT OR REPLACE INTO scheduler (id, active, interval_ms, filename, arg1, arg2) VALUES (9999, 1, 500, '/bin/sh', '-c', 'kill -${sig#SIG} \$(pidof proxysql)');"
fn_padmin "LOAD SCHEDULER TO RUNTIME;"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Combining the INSERT and LOAD statements into a single fn_padmin call is more efficient as it avoids the overhead of establishing multiple database connections and invoking the mysql client process twice.

Suggested change
fn_padmin "INSERT OR REPLACE INTO scheduler (id, active, interval_ms, filename, arg1, arg2) VALUES (9999, 1, 500, '/bin/sh', '-c', 'kill -${sig#SIG} \$(pidof proxysql)');"
fn_padmin "LOAD SCHEDULER TO RUNTIME;"
fn_padmin "INSERT OR REPLACE INTO scheduler (id, active, interval_ms, filename, arg1, arg2) VALUES (9999, 1, 500, '/bin/sh', '-c', 'kill -${sig#SIG} \$(pidof proxysql)'); LOAD SCHEDULER TO RUNTIME;"

Comment on lines +96 to +97
fn_padmin "DELETE FROM scheduler WHERE id=9999;" >/dev/null 2>&1 || true
fn_padmin "LOAD SCHEDULER TO RUNTIME;" >/dev/null 2>&1 || true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the fn_signal function, combining the DELETE and LOAD statements here reduces process and connection overhead during the test cleanup phase.

Suggested change
fn_padmin "DELETE FROM scheduler WHERE id=9999;" >/dev/null 2>&1 || true
fn_padmin "LOAD SCHEDULER TO RUNTIME;" >/dev/null 2>&1 || true
fn_padmin "DELETE FROM scheduler WHERE id=9999; LOAD SCHEDULER TO RUNTIME;" >/dev/null 2>&1 || true

local count=0
local max_iterations=$(( max_seconds * 2 ))
while [ $i -lt $max_iterations ]; do
count=$(fn_get_rotations)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The output of wc -l (used inside fn_get_rotations) often contains leading or trailing whitespace depending on the platform's implementation. Stripping this whitespace ensures that the count variable contains a clean integer, which makes the subsequent numeric comparisons and TAP output more robust.

Suggested change
count=$(fn_get_rotations)
count=$(fn_get_rotations | tr -d '[:space:]')

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/tap/tests/test_flush_logs-t.sh`:
- Around line 92-98: The test installs a scheduler row id=9999 that can remain
active if the script exits or is interrupted; add a best-effort cleanup from
fn_exit so that the scheduler row and runtime are removed even on unexpected
exits. Modify or ensure the EXIT/INT/TERM trap calls fn_signal_cleanup (or add a
call to fn_signal_cleanup inside fn_exit) and make fn_signal_cleanup resilient
(silently ignore failures) when calling fn_padmin "DELETE FROM scheduler WHERE
id=9999;" and "LOAD SCHEDULER TO RUNTIME;" so the 500ms job won't continue
sending SIGUSR1 to ProxySQL after the test exits.
- Around line 135-149: The polling loop computes max_iterations as max_seconds *
2 which causes the last check to occur at max_seconds - 0.5, missing events that
happen right before the advertised timeout; update the max_iterations
calculation to allow one extra half-second iteration (e.g., add 1) so the final
fn_get_rotations check runs at max_seconds, ensuring the loop (variables
max_iterations, max_seconds, i and the fn_get_rotations call) will observe
rotations that land just before the timeout.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 267ebfd1-e1fd-48d4-b86b-395f410f585e

📥 Commits

Reviewing files that changed from the base of the PR and between 04466b3 and 5444125.

📒 Files selected for processing (1)
  • test/tap/tests/test_flush_logs-t.sh
📜 Review details
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: yuji-hatakeyama
Repo: sysown/proxysql PR: 5307
File: test/tap/tests/reg_test_5306-show_warnings_with_comment-t.cpp:39-48
Timestamp: 2026-01-20T09:34:27.165Z
Learning: In ProxySQL test files (test/tap/tests/), resource leaks (such as not calling `mysql_close()` on early return paths) are not typically fixed because test processes are short-lived and the OS frees resources on process exit. This is a common pattern across the test suite.
📚 Learning: 2026-04-01T21:27:03.216Z
Learnt from: wazir-ahmed
Repo: sysown/proxysql PR: 5557
File: test/tap/tests/unit/gtid_set_unit-t.cpp:14-17
Timestamp: 2026-04-01T21:27:03.216Z
Learning: In ProxySQL's unit test directory (test/tap/tests/unit/), test_globals.h and test_init.h are only required for tests that depend on the ProxySQL runtime globals/initialization (i.e., tests that exercise components linked against libproxysql.a). Pure data-structure or utility tests (e.g., ezoption_parser_unit-t.cpp, gtid_set_unit-t.cpp, gtid_trxid_interval_unit-t.cpp) only need tap.h and the relevant project header — omitting test_globals.h and test_init.h is correct and intentional in these cases.

Applied to files:

  • test/tap/tests/test_flush_logs-t.sh
📚 Learning: 2026-01-20T09:34:27.165Z
Learnt from: yuji-hatakeyama
Repo: sysown/proxysql PR: 5307
File: test/tap/tests/reg_test_5306-show_warnings_with_comment-t.cpp:39-48
Timestamp: 2026-01-20T09:34:27.165Z
Learning: In ProxySQL test files (test/tap/tests/), resource leaks (such as not calling `mysql_close()` on early return paths) are not typically fixed because test processes are short-lived and the OS frees resources on process exit. This is a common pattern across the test suite.

Applied to files:

  • test/tap/tests/test_flush_logs-t.sh
📚 Learning: 2026-01-20T07:40:34.938Z
Learnt from: yuji-hatakeyama
Repo: sysown/proxysql PR: 5307
File: test/tap/tests/reg_test_5306-show_warnings_with_comment-t.cpp:24-28
Timestamp: 2026-01-20T07:40:34.938Z
Learning: In ProxySQL test files, calling `mysql_error(NULL)` after `mysql_init()` failure is safe because the MariaDB client library implementation returns an empty string for NULL handles (not undefined behavior).

Applied to files:

  • test/tap/tests/test_flush_logs-t.sh
🪛 Shellcheck (0.11.0)
test/tap/tests/test_flush_logs-t.sh

[info] 139-139: Double quote to prevent globbing and word splitting.

(SC2086)


[info] 148-148: Double quote to prevent globbing and word splitting.

(SC2086)


[style] 153-153: $/${} is unnecessary on arithmetic variables.

(SC2004)


[info] 154-154: Double quote to prevent globbing and word splitting.

(SC2086)


[style] 160-160: $/${} is unnecessary on arithmetic variables.

(SC2004)


[style] 161-161: $/${} is unnecessary on arithmetic variables.

(SC2004)


[style] 162-162: $/${} is unnecessary on arithmetic variables.

(SC2004)


[style] 176-176: $/${} is unnecessary on arithmetic variables.

(SC2004)


[style] 182-182: $/${} is unnecessary on arithmetic variables.

(SC2004)

🔇 Additional comments (2)
test/tap/tests/test_flush_logs-t.sh (2)

77-90: Good call moving teardown out of fn_signal.

Keeping the scheduler row alive until the effect is observed removes the original “delete before first fire” race.


155-165: >= BASELINE + 1 matches the new test semantics.

Allowing extra rotations here is the right trade-off once the scheduler path can legitimately fire more than once before cleanup.

Comment on lines +92 to 98
fn_signal_cleanup () {
# Tear down any scheduler row installed by fn_signal. Safe to call
# unconditionally - if fn_signal used the docker path instead of the
# scheduler hack, this is a no-op on a non-existent row.
fn_padmin "DELETE FROM scheduler WHERE id=9999;" >/dev/null 2>&1 || true
fn_padmin "LOAD SCHEDULER TO RUNTIME;" >/dev/null 2>&1 || true
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add a best-effort cleanup from fn_exit.

If the script exits or is interrupted after installing scheduler row 9999 but before the explicit cleanup call, that 500 ms job keeps sending SIGUSR1 to the live ProxySQL instance. This can bleed into later TAP tests and make failures non-local.

Suggested guardrail
 fn_exit () {
 	trap - EXIT
 	trap - SIGINT
+	fn_signal_cleanup >/dev/null 2>&1 || true
 	if [[ $DONE -eq $PLAN ]] && [[ $FAIL -eq 0 ]]; then
 		echo "msg: Test took $SECONDS sec"
 		exit 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/tap/tests/test_flush_logs-t.sh` around lines 92 - 98, The test installs
a scheduler row id=9999 that can remain active if the script exits or is
interrupted; add a best-effort cleanup from fn_exit so that the scheduler row
and runtime are removed even on unexpected exits. Modify or ensure the
EXIT/INT/TERM trap calls fn_signal_cleanup (or add a call to fn_signal_cleanup
inside fn_exit) and make fn_signal_cleanup resilient (silently ignore failures)
when calling fn_padmin "DELETE FROM scheduler WHERE id=9999;" and "LOAD
SCHEDULER TO RUNTIME;" so the 500ms job won't continue sending SIGUSR1 to
ProxySQL after the test exits.

Comment on lines +135 to +149
local max_iterations=$(( max_seconds * 2 ))
while [ $i -lt $max_iterations ]; do
count=$(fn_get_rotations)
if [ "$count" -ge "$target" ]; then
echo $count
return 0
fi
sleep 0.5
i=$(( i + 1 ))
done
# Timeout. Emit whatever we last observed so fn_check_res will
# report a useful "BASELINE X expected Y got Z" message rather than
# an empty RES.
echo $count
return 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

The polling loop misses the last eligible sample.

With max_iterations=$(( max_seconds * 2 )), the final check happens at max_seconds - 0.5, not at max_seconds. A rotation that lands just before the advertised timeout still fails, so the flake window is smaller but not fully closed.

Minimal fix
-	local max_iterations=$(( max_seconds * 2 ))
+	local max_iterations=$(( max_seconds * 2 + 1 ))
 	while [ $i -lt $max_iterations ]; do
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
local max_iterations=$(( max_seconds * 2 ))
while [ $i -lt $max_iterations ]; do
count=$(fn_get_rotations)
if [ "$count" -ge "$target" ]; then
echo $count
return 0
fi
sleep 0.5
i=$(( i + 1 ))
done
# Timeout. Emit whatever we last observed so fn_check_res will
# report a useful "BASELINE X expected Y got Z" message rather than
# an empty RES.
echo $count
return 1
local max_iterations=$(( max_seconds * 2 + 1 ))
while [ $i -lt $max_iterations ]; do
count=$(fn_get_rotations)
if [ "$count" -ge "$target" ]; then
echo $count
return 0
fi
sleep 0.5
i=$(( i + 1 ))
done
# Timeout. Emit whatever we last observed so fn_check_res will
# report a useful "BASELINE X expected Y got Z" message rather than
# an empty RES.
echo $count
return 1
🧰 Tools
🪛 Shellcheck (0.11.0)

[info] 139-139: Double quote to prevent globbing and word splitting.

(SC2086)


[info] 148-148: Double quote to prevent globbing and word splitting.

(SC2086)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/tap/tests/test_flush_logs-t.sh` around lines 135 - 149, The polling loop
computes max_iterations as max_seconds * 2 which causes the last check to occur
at max_seconds - 0.5, missing events that happen right before the advertised
timeout; update the max_iterations calculation to allow one extra half-second
iteration (e.g., add 1) so the final fn_get_rotations check runs at max_seconds,
ensuring the loop (variables max_iterations, max_seconds, i and the
fn_get_rotations call) will observe rotations that land just before the timeout.

@renecannao renecannao merged commit caa2129 into v3.0 Apr 11, 2026
13 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky TAP tests: track and fix timing-sensitive tests revealed by cascade fix

1 participant