You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With the CI cascade now actually running tests (since PR #5601, #5602, #5603, #5605 fixed the multi-week silent false-green), several latent flaky TAP tests have started surfacing. This issue tracks them so they can be investigated and hardened individually.
Intentionally filed as a bucket issue rather than four separate ones — the root-cause analysis for each is short and similar enough (timing race against a slow CI runner) that having them in one place makes bisecting easier.
Flaky tests observed
All four of these failed on commit 09b97547fd19ad86045783f63218fdcfa484a910 of PR #5596 (in runs 24281031531, 24281031522, 24281031524) while the exact same code passed on commit 920308a1e90fba8cced42128f70b45fe7c3e3dbd (also on PR #5596, ~hour earlier). Those two commits are both empty CI-retrigger commits with zero code diff between them, so the pass/fail outcomes are environmental, not a regression.
Test
Workflow
Group
Notes
test_flush_logs-t
CI-legacy-g3
legacy/mysql57
Shell-based. Uses a "Scheduler Hack" to send SIGUSR1 to ProxySQL, then reads the log file. ~5 s total wall time between signal-scheduling and log-reading. Reproduced PASSING locally in 9 seconds on the same PR commit.
pgsql-servers_ssl_params-t
CI-legacy-g4
legacy/pgsql
Source file was modified on 2026-04-09 (ca75d563e "test: add cluster query integration test") — recent addition may be racing against backend readiness.
Last touched by 26c5a2572 ("Fix read_only test assuming default_hostgroup=0") and 7ce2043b6.
How to reproduce locally
Once #5609 (doc update for test/README.md) merges, the canonical recipe is in the "Debugging a flaky test" section there. Quick version:
```bash
export WORKSPACE=$(pwd)
export INFRA_ID="flake-$(date +%s)"
export TAP_GROUP="legacy-g3" # or legacy-g4, or mysql84-g4
export TEST_PY_TAP_INCL="test_flush_logs-t" # narrow to the flaky test
export SKIP_CLUSTER_START=1
source test/infra/common/env.sh
./test/infra/control/ensure-infras.bash
for i in $(seq 1 20); do
./test/infra/control/run-tests-isolated.bash 2>&1 | tee /tmp/flake-$i.log
done
./test/infra/control/stop-proxysql-isolated.bash
grep -l 'FAIL [1-9]' /tmp/flake-*.log # which attempts failed
```
Related CI-artifact gap
When these tests fail on CI, `gh api repos/sysown/proxysql/actions/runs//artifacts` returns an empty array. The reason is that actions/upload-artifact@v4 dies with EACCES on docker-created root-owned files under ci_infra_logs/ — the same bug we fixed in ci-builds.yml via PR #5605. The downstream reusables were missed and are fixed in PR #5608.
With #5608 merged, future flaky failures will have their per-test `.log.gz` files captured as CI artifacts, making investigation much easier. Until then, local reproduction is the only path to a real log.
Evidence this is environmental, not a code regression
test_flush_logs-t hardened: replace the fixed sleep 4-after-scheduler-hack with a poll-until-rotation-observed loop, so slow runners don't lose the race.
pgsql-servers_ssl_params-t hardened: investigate the recently-added cluster-query integration assertion, confirm backend-readiness wait is sufficient.
pgsql-ssl_keylog-t hardened: investigate SSL handshake timing on slow runners.
test_read_only_actions_offline_hard_servers-t hardened: investigate read-only state propagation timing.
All four pass 10 consecutive CI runs on v3.0 main (workflow_dispatch empty-commit loop is fine for this).
Summary
With the CI cascade now actually running tests (since PR #5601, #5602, #5603, #5605 fixed the multi-week silent false-green), several latent flaky TAP tests have started surfacing. This issue tracks them so they can be investigated and hardened individually.
Intentionally filed as a bucket issue rather than four separate ones — the root-cause analysis for each is short and similar enough (timing race against a slow CI runner) that having them in one place makes bisecting easier.
Flaky tests observed
All four of these failed on commit
09b97547fd19ad86045783f63218fdcfa484a910of PR #5596 (in runs24281031531,24281031522,24281031524) while the exact same code passed on commit920308a1e90fba8cced42128f70b45fe7c3e3dbd(also on PR #5596, ~hour earlier). Those two commits are both empty CI-retrigger commits with zero code diff between them, so the pass/fail outcomes are environmental, not a regression.test_flush_logs-tCI-legacy-g3pgsql-servers_ssl_params-tCI-legacy-g4ca75d563e"test: add cluster query integration test") — recent addition may be racing against backend readiness.pgsql-ssl_keylog-tCI-legacy-g4test_read_only_actions_offline_hard_servers-tCI-mysql84-g426c5a2572("Fix read_only test assuming default_hostgroup=0") and7ce2043b6.How to reproduce locally
Once #5609 (doc update for
test/README.md) merges, the canonical recipe is in the "Debugging a flaky test" section there. Quick version:```bash
export WORKSPACE=$(pwd)
export INFRA_ID="flake-$(date +%s)"
export TAP_GROUP="legacy-g3" # or legacy-g4, or mysql84-g4
export TEST_PY_TAP_INCL="test_flush_logs-t" # narrow to the flaky test
export SKIP_CLUSTER_START=1
source test/infra/common/env.sh
./test/infra/control/ensure-infras.bash
for i in $(seq 1 20); do
./test/infra/control/run-tests-isolated.bash 2>&1 | tee /tmp/flake-$i.log
done
./test/infra/control/stop-proxysql-isolated.bash
grep -l 'FAIL [1-9]' /tmp/flake-*.log # which attempts failed
```
Related CI-artifact gap
When these tests fail on CI, `gh api repos/sysown/proxysql/actions/runs//artifacts` returns an empty array. The reason is that
actions/upload-artifact@v4dies with EACCES on docker-created root-owned files underci_infra_logs/— the same bug we fixed inci-builds.ymlvia PR #5605. The downstream reusables were missed and are fixed in PR #5608.With #5608 merged, future flaky failures will have their per-test `.log.gz` files captured as CI artifacts, making investigation much easier. Until then, local reproduction is the only path to a real log.
Evidence this is environmental, not a code regression
CopyCmdMatcher.cpp(PgSQL COPY parsing). None of the four failing tests touch COPY parsing, the PgSQL protocol path, or any code the PR diff reaches.920308a1e(empty CI-retrigger, same code as09b97547f): all 4 workflows passed.09b97547f(empty CI-retrigger, same code): 3 of 4 failed.CI-legacy-g3also failed on669ab9149which is a direct-on-v3.0 merge commit (PR lint + static analysis: clang-tidy and cppcheck fixes #5594), not on a feature branch, confirming it is pre-existing rather than PR-specific.Acceptance criteria
test_flush_logs-thardened: replace the fixedsleep 4-after-scheduler-hack with a poll-until-rotation-observed loop, so slow runners don't lose the race.pgsql-servers_ssl_params-thardened: investigate the recently-added cluster-query integration assertion, confirm backend-readiness wait is sufficient.pgsql-ssl_keylog-thardened: investigate SSL handshake timing on slow runners.test_read_only_actions_offline_hard_servers-thardened: investigate read-only state propagation timing.