doc(test): document per-test log paths and flaky-test debugging recipe by renecannao · Pull Request #5609 · sysown/proxysql

renecannao · 2026-04-11T13:06:42Z

Summary

Fixes two documentation gaps in test/README.md that surfaced while investigating the flaky TAP test failures reported on PR #5596.

Gap 1: per-test log paths were undocumented

The README said "Check logs in ci_infra_logs/\${INFRA_ID}/tests/" but did not document what's actually under that directory. Per-test TAP output lives three directories deep at:

ci_infra_logs/${INFRA_ID}/tests/proxysql-tester.py/tests/<test>-t.log.gz

and the files are gzipped. A maintainer opening one with cat or vim without knowing to use zless burns minutes figuring it out.

Fix: new section "Where logs actually live after a run" showing the full directory tree (per-backend, per-ProxySQL, per-test subdirs) and the concrete zless / zcat / zgrep / zdiff one-liners for reading the gzipped output.

Gap 2: no guidance for reproducing flaky tests locally

Until today there was no canonical answer to "this test passed locally and failed on CI — is it really flaky, or is it my environment?". The natural recipe (reuse one infra, loop N runs against the same running ProxySQL, snapshot per-run logs, diff) was nowhere in the repo.

Fix: new section "Debugging a flaky test" with a copy-pasteable loop that runs the same test 20 times against one infra lifecycle, stashes per-attempt per-test logs under /tmp/flake-runs/\$i/, and shows the zdiff idiom for comparing a failing attempt's TAP output against a passing one.

Other small improvements

Troubleshooting bullet list expanded with:
- How to recover from a lost INFRA_ID (via docker network ls — each *_backend network is a stuck infra).
- Where each flavor of log lives (per-backend under infra-*/, per-ProxySQL under proxysql/, per-test under tests/proxysql-tester.py/tests/).
- How to clean up dangling docker state (docker ps -a | grep, docker network prune).

Test plan

test/README.md parses as valid markdown, line count went from 73 → 159 (+89 net)
Visual review in GitHub's rendered markdown once the PR is open

No code or workflow changes. Doc-only.

PR ci: fix upload-artifact EACCES in all downstream test reusables #5608 (GH-Actions branch): fixes the upload-artifact EACCES issue that was also blocking maintainers from seeing CI failure logs. With both PRs merged, a maintainer investigating a flaky test can now: (a) see the .log.gz in CI-uploaded artifacts (PR ci: fix upload-artifact EACCES in all downstream test reusables #5608 fix), or (b) reproduce locally following this doc's new recipe.

Summary by CodeRabbit

Documentation
- Specifies exact on-disk log locations and layout (per-backend, ProxySQL, per-test gzipped artifacts).
- Adds concrete inspection commands for gzipped logs (zless, zcat, zgrep, zdiff).
- Expands flaky-test troubleshooting with a reuseable infra lifecycle, per-attempt log capture/stash, and comparison workflow.
- Clarifies detection and cleanup of stale docker networks using docker network ls and docker network rm.

Two gaps in test/README.md surfaced while investigating flaky TAP test failures on PR #5596: 1. The README said "Check logs in ci_infra_logs/${INFRA_ID}/tests/" but did not document the actual layout below that — per-test TAP output lives three directories deep at: ci_infra_logs/${INFRA_ID}/tests/proxysql-tester.py/tests/<test>-t.log.gz The files are gzipped, which makes `cat` and most editors do the wrong thing (binary garbage). A maintainer opening one with vim without knowing to use zless burns several minutes before figuring it out. 2. There was no guidance on how to reproduce a flaky test locally for stress-testing. The typical question "this test passed twice locally and fails on CI - is it really flaky, or is it my environment?" has no canonical answer anywhere in the repo. The natural recipe (reuse one infra, loop N runs, snapshot per-run logs, diff) is described in enough detail for a maintainer to copy/paste it without thinking. This commit adds: - A new "Where logs actually live after a run" section showing the full ci_infra_logs/ directory tree with per-backend, per-ProxySQL, and per-test subdirs. - Concrete zless/zcat/zgrep/zdiff one-liners for reading the gzipped per-test logs (including the cross-test grep idiom). - A new "Debugging a flaky test" section with a copy-pasteable loop that runs the same test 20 times against one infra, snapshotting per-attempt per-test logs for later diffing. - An expanded Troubleshooting bullet list: how to recover from lost INFRA_ID (via `docker network ls`), where each flavor of log lives, and how to clean up stale docker state. No code or workflow changes - doc-only update.

coderabbitai · 2026-04-11T13:06:57Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e58ef264-02b7-4122-9569-610ca4b2d3e7

📥 Commits

Reviewing files that changed from the base of the PR and between f4313be and 0617626.

📒 Files selected for processing (1)

test/README.md

✅ Files skipped from review due to trivial changes (1)

test/README.md

📝 Walkthrough

Walkthrough

Updated test README to document exact on-disk log locations after runs, include per-backend and per-test gzipped log layout, provide zless/zcat/zgrep/zdiff inspection commands, and expand flaky-test local reproduction and Docker troubleshooting steps (specific paths and targeted cleanup).

Changes

Cohort / File(s)	Summary
Documentation `test/README.md`	Added exact post-run log storage location (`ci_infra_logs/${INFRA_ID}/`), directory tree for infra/proxysql/tests logs, concrete gzipped-log inspection commands (`zless`, `zcat`, `zgrep`, `zdiff`), step-by-step flaky-test local reproduction workflow reusing infra and stashing per-attempt logs, and updated Docker troubleshooting guidance (identify stale infra via `docker network ls`, remove with `docker network rm`).

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related issues

Flaky TAP tests: track and fix timing-sensitive tests revealed by cascade fix #5610 — Adds flaky-test debugging workflow and exact log paths; this PR documents the same reproduction and log inspection steps referenced by that issue.

Poem

🐰 Hop, hop, I dug the trail of logs,

gzipped crumbs in tidy cogs.
Reproduce, compare, then mend —
one infra, many tries, no end.
Cheers from a rabbit, debugging friend!

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and accurately describes the main change: documentation updates to explain per-test log paths and add a flaky-test debugging workflow.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch v3.0-doc-test-readme

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request significantly improves the testing documentation in test/README.md by adding a detailed guide on log locations, a workflow for debugging flaky tests locally, and enhanced troubleshooting instructions. The review feedback suggests refining the grep pattern used to identify test failures to ensure consistency with TAP output and recommending the removal of specific Docker networks instead of a global prune to avoid unintended side effects on other projects.

gemini-code-assist · 2026-04-11T13:08:27Z

+
+```bash
+# Which attempts had any FAIL?
+grep -l 'FAIL [1-9]' /tmp/flake-*.log


The grep pattern 'FAIL [1-9]' might not capture all failure scenarios, especially since TAP output (the standard for these tests) uses not ok to indicate a failed test. For consistency with the zgrep example provided in the "Where logs actually live" section, consider using a pattern that includes both not ok and FAIL.

Suggested change

grep -l 'FAIL [1-9]' /tmp/flake-*.log

grep -lE 'not ok|FAIL' /tmp/flake-*.log

gemini-code-assist · 2026-04-11T13:08:27Z

+- **"Directory Not Empty"**: Run `./test/infra/control/stop-proxysql-isolated.bash` with the same `INFRA_ID` that was used when you started the infra. If you lost the ID, `docker network ls` will show you active `*_backend` networks — each one is a stuck infra; the name before `_backend` is the `INFRA_ID`.
+- **Container issues**: Check logs in `ci_infra_logs/${INFRA_ID}/infra-*/` (per-backend) and `ci_infra_logs/${INFRA_ID}/proxysql/` (ProxySQL side).
+- **Test failures**: Read the per-test `.log.gz` files under `ci_infra_logs/${INFRA_ID}/tests/proxysql-tester.py/tests/` with `zless` or `zcat` — see the "Where logs actually live" section above for the full layout.
+- **Stale docker state**: `docker ps -a | grep "${INFRA_ID}"` shows any leftover containers; `docker network prune` after stopping infras cleans up dangling networks.


docker network prune is a global operation that removes all unused networks on the system, which might affect other Docker projects running on the same host. It is safer to suggest removing the specific network associated with the INFRA_ID first, while keeping prune as a secondary option for general cleanup.

Suggested change

- **Stale docker state**: `docker ps -a | grep "${INFRA_ID}"` shows any leftover containers; `docker network prune` after stopping infras cleans up dangling networks.

- **Stale docker state**: `docker ps -a | grep "${INFRA_ID}"` shows any leftover containers; `docker network rm ${INFRA_ID}_backend` cleans up the specific network, or use `docker network prune` to clean all dangling networks.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/README.md`:
- Around line 74-92: Add the missing language tag to the fenced code block that
begins with the directory tree starting "ci_infra_logs/${INFRA_ID}/": change the
opening triple-backtick fence from ``` to ```text so the tree block is marked as
plain text and markdownlint MD040 is satisfied; update the fence that wraps the
block containing lines like "infra-mysql57/" and "proxysql-tester.py/"
accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ec7f8c18-757f-4b92-880d-072c166d2a76

📥 Commits

Reviewing files that changed from the base of the PR and between e3e769f and f4313be.

📒 Files selected for processing (1)

test/README.md

📜 Review details

🧰 Additional context used

🧠 Learnings (4)

📓 Common learnings

Learnt from: CR
Repo: sysown/proxysql PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-11T05:43:20.598Z
Learning: Applies to test/tap/**/*.cpp : Use TAP (Test Anything Protocol) for tests with Docker-based backend infrastructure

📚 Learning: 2026-04-11T05:43:20.598Z

Learnt from: CR
Repo: sysown/proxysql PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-11T05:43:20.598Z
Learning: Applies to test/tap/**/*.cpp : Use TAP (Test Anything Protocol) for tests with Docker-based backend infrastructure

Applied to files:

test/README.md

📚 Learning: 2026-01-20T09:34:27.165Z

Learnt from: yuji-hatakeyama
Repo: sysown/proxysql PR: 5307
File: test/tap/tests/reg_test_5306-show_warnings_with_comment-t.cpp:39-48
Timestamp: 2026-01-20T09:34:27.165Z
Learning: In ProxySQL test files (test/tap/tests/), resource leaks (such as not calling `mysql_close()` on early return paths) are not typically fixed because test processes are short-lived and the OS frees resources on process exit. This is a common pattern across the test suite.

Applied to files:

test/README.md

📚 Learning: 2026-04-11T05:43:20.598Z

Learnt from: CR
Repo: sysown/proxysql PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-11T05:43:20.598Z
Learning: See `doc/agents/project-conventions.md` for ProxySQL-specific rules including directories, build, test harness, and git workflow

Applied to files:

test/README.md

🪛 markdownlint-cli2 (0.22.0)

test/README.md

[warning] 74-74: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (1)

test/README.md (1)
156-156: The README wording at line 156 is accurate and requires no changes. The Docker network name is constructed as ${INFRA_ID}_backend (not as a composite with INFRA prefix). Therefore, the instruction to extract INFRA_ID by taking the text before _backend is correct.
			> Likely an incorrect or invalid review comment.

…ction Three minor corrections from the PR #5609 review: 1. (coderabbit, MD040) The directory-tree fenced block at L74 had no language tag. Added `text` so markdownlint and GitHub's renderer treat it as plain text rather than guessing a language that would then syntax-highlight the tree characters incorrectly. 2. (gemini, L142) The flake-debug loop's "which attempts failed" grep was too narrow: `grep 'FAIL [1-9]'` only matches the proxysql-tester.py summary line "SUMMARY: 'tests' PASS N/M : FAIL K/M" when K is non-zero - but misses tests that fail at the TAP level via "not ok" without incrementing the FAIL counter (e.g. when the binary crashes or times out). Broadened to `grep -lE 'not ok|FAIL [1-9]'` so both paths are caught, matching the idiom already used in the "Where logs actually live" section above. 3. (gemini, L159) The Troubleshooting bullet suggested `docker network prune` as the cleanup idiom for dangling networks. `network prune` is a GLOBAL operation - it would also wipe networks belonging to unrelated Docker projects on the same host, which is user-hostile if a maintainer has other test infras running in parallel. Changed to prefer the targeted form `docker network rm "${INFRA_ID}_backend"` and explicitly call out why `prune` is the wrong default. No change to the "Where logs actually live" section, the "Debugging a flaky test" loop itself, or any of the other content. Doc-only update on top of the earlier commit.

sonarqubecloud · 2026-04-11T13:17:45Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

renecannao mentioned this pull request Apr 11, 2026

Flaky TAP tests: track and fix timing-sensitive tests revealed by cascade fix #5610

Closed

5 tasks

gemini-code-assist bot reviewed Apr 11, 2026

View reviewed changes

coderabbitai bot reviewed Apr 11, 2026

View reviewed changes

Comment thread test/README.md Outdated

renecannao merged commit ca1a7a4 into v3.0 Apr 11, 2026
3 checks passed

This was referenced Apr 11, 2026

test(flush_logs): harden timing race — poll instead of fixed sleep #5611

Merged

test(pgsql-ssl_keylog): fix container-local path + NSS label regex #5612

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc(test): document per-test log paths and flaky-test debugging recipe#5609

doc(test): document per-test log paths and flaky-test debugging recipe#5609
renecannao merged 2 commits intov3.0from
v3.0-doc-test-readme

renecannao commented Apr 11, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 11, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Poem

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 11, 2026

Uh oh!

gemini-code-assist bot Apr 11, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

sonarqubecloud bot commented Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	grep -l 'FAIL [1-9]' /tmp/flake-*.log
	grep -lE 'not ok\|FAIL' /tmp/flake-*.log

	- Stale docker state: `docker ps -a \| grep "${INFRA_ID}"` shows any leftover containers; `docker network prune` after stopping infras cleans up dangling networks.
	- Stale docker state: `docker ps -a \| grep "${INFRA_ID}"` shows any leftover containers; `docker network rm ${INFRA_ID}_backend` cleans up the specific network, or use `docker network prune` to clean all dangling networks.

Conversation

renecannao commented Apr 11, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Gap 1: per-test log paths were undocumented

Gap 2: no guidance for reproducing flaky tests locally

Other small improvements

Test plan

Related

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Poem

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sonarqubecloud bot commented Apr 11, 2026

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

renecannao commented Apr 11, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 11, 2026 •

edited

Loading