Skip to content

Flaky TAP tests: track and fix timing-sensitive tests revealed by cascade fix #5610

@renecannao

Description

@renecannao

Summary

With the CI cascade now actually running tests (since PR #5601, #5602, #5603, #5605 fixed the multi-week silent false-green), several latent flaky TAP tests have started surfacing. This issue tracks them so they can be investigated and hardened individually.

Intentionally filed as a bucket issue rather than four separate ones — the root-cause analysis for each is short and similar enough (timing race against a slow CI runner) that having them in one place makes bisecting easier.

Flaky tests observed

All four of these failed on commit 09b97547fd19ad86045783f63218fdcfa484a910 of PR #5596 (in runs 24281031531, 24281031522, 24281031524) while the exact same code passed on commit 920308a1e90fba8cced42128f70b45fe7c3e3dbd (also on PR #5596, ~hour earlier). Those two commits are both empty CI-retrigger commits with zero code diff between them, so the pass/fail outcomes are environmental, not a regression.

Test Workflow Group Notes
test_flush_logs-t CI-legacy-g3 legacy/mysql57 Shell-based. Uses a "Scheduler Hack" to send SIGUSR1 to ProxySQL, then reads the log file. ~5 s total wall time between signal-scheduling and log-reading. Reproduced PASSING locally in 9 seconds on the same PR commit.
pgsql-servers_ssl_params-t CI-legacy-g4 legacy/pgsql Source file was modified on 2026-04-09 (ca75d563e "test: add cluster query integration test") — recent addition may be racing against backend readiness.
pgsql-ssl_keylog-t CI-legacy-g4 legacy/pgsql Added in #5281 (SSL/TLS keylog support).
test_read_only_actions_offline_hard_servers-t CI-mysql84-g4 mysql84 Last touched by 26c5a2572 ("Fix read_only test assuming default_hostgroup=0") and 7ce2043b6.

How to reproduce locally

Once #5609 (doc update for test/README.md) merges, the canonical recipe is in the "Debugging a flaky test" section there. Quick version:

```bash
export WORKSPACE=$(pwd)
export INFRA_ID="flake-$(date +%s)"
export TAP_GROUP="legacy-g3" # or legacy-g4, or mysql84-g4
export TEST_PY_TAP_INCL="test_flush_logs-t" # narrow to the flaky test
export SKIP_CLUSTER_START=1
source test/infra/common/env.sh
./test/infra/control/ensure-infras.bash
for i in $(seq 1 20); do
./test/infra/control/run-tests-isolated.bash 2>&1 | tee /tmp/flake-$i.log
done
./test/infra/control/stop-proxysql-isolated.bash
grep -l 'FAIL [1-9]' /tmp/flake-*.log # which attempts failed
```

Related CI-artifact gap

When these tests fail on CI, `gh api repos/sysown/proxysql/actions/runs//artifacts` returns an empty array. The reason is that actions/upload-artifact@v4 dies with EACCES on docker-created root-owned files under ci_infra_logs/ — the same bug we fixed in ci-builds.yml via PR #5605. The downstream reusables were missed and are fixed in PR #5608.

With #5608 merged, future flaky failures will have their per-test `.log.gz` files captured as CI artifacts, making investigation much easier. Until then, local reproduction is the only path to a real log.

Evidence this is environmental, not a code regression

  • PR PgSQL: fast-reject non-COPY queries in CopyCmdMatcher #5596 only modifies CopyCmdMatcher.cpp (PgSQL COPY parsing). None of the four failing tests touch COPY parsing, the PgSQL protocol path, or any code the PR diff reaches.
  • Commit 920308a1e (empty CI-retrigger, same code as 09b97547f): all 4 workflows passed.
  • Commit 09b97547f (empty CI-retrigger, same code): 3 of 4 failed.
  • CI-legacy-g3 also failed on 669ab9149 which is a direct-on-v3.0 merge commit (PR lint + static analysis: clang-tidy and cppcheck fixes #5594), not on a feature branch, confirming it is pre-existing rather than PR-specific.

Acceptance criteria

  • test_flush_logs-t hardened: replace the fixed sleep 4-after-scheduler-hack with a poll-until-rotation-observed loop, so slow runners don't lose the race.
  • pgsql-servers_ssl_params-t hardened: investigate the recently-added cluster-query integration assertion, confirm backend-readiness wait is sufficient.
  • pgsql-ssl_keylog-t hardened: investigate SSL handshake timing on slow runners.
  • test_read_only_actions_offline_hard_servers-t hardened: investigate read-only state propagation timing.
  • All four pass 10 consecutive CI runs on v3.0 main (workflow_dispatch empty-commit loop is fine for this).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions