Skip to content

CI test fixes: cluster sync, plan counts, test group assignments, and diagnostics#5538

Merged
renecannao merged 20 commits intov3.0from
v3.0-ci260323
Mar 24, 2026
Merged

CI test fixes: cluster sync, plan counts, test group assignments, and diagnostics#5538
renecannao merged 20 commits intov3.0from
v3.0-ci260323

Conversation

@renecannao
Copy link
Copy Markdown
Contributor

@renecannao renecannao commented Mar 23, 2026

Summary

Fixes multiple TAP test failures and CI infrastructure issues discovered during CI migration testing.

Test fixes

  • test_tls_stats-t: Fix plan(29)plan(27) — the test executes 27 ok() calls on the happy path, plan was miscounted
  • test_query_cache_soft_ttl_pct-t: Add tolerance of +1 to cache timing assertions — the soft TTL refresh race allows two threads to both see "needs refresh" simultaneously
  • pgsql-transaction_variable_state_tracking-t: Replace superuser-only allow_in_place_tablespaces with user-settable enable_mergejoin (fixes pgsql-transaction_variable_state_tracking-t: permission denied on superuser-only variables #5536)
  • test_auth_methods-t: Call stop_internal_noise_threads() before early return on MySQL 5.7, preventing crash from joinable std::thread destructors (fixes test_auth_methods-t: crashes with 'terminate called without an active exception' on MySQL 5.7 #5537)
  • test_cluster_sync-t: Use 127.0.0.1 for locally-spawned replica instead of cl.host (Docker hostname of main container). Set cluster_mysql_servers_sync_algorithm=3 on replica so it uses v2 sync (config-only, ignores monitor-shunned status)

Test group reassignments

  • reg_test_4935-caching_sha2-t: Remove from legacy-g4caching_sha2_password requires MySQL 8.0+
  • test_auth_methods-t: Remove from legacy-g2 — requires MySQL 8.0+

Cluster sync fixes

Coverage infrastructure overhaul

Fixed machine stalls (load >1000, 5GB+ resident per gcov process) and 0% coverage reports in multi-group CI runs. Root causes:

  1. Shared workspace corruption: All test-runner containers wrote TAP test .gcda files to the same ${WORKSPACE} paths. Concurrent gcov processes on the same .gcda files caused memory bloat and data corruption.

    • Fix: Set GCOV_PREFIX=/gcov/tap on test-runner containers so TAP .gcda files go to per-INFRA_ID /gcov/tap/ directories.
  2. Per-container fastcov storm: Each container ran fastcov -j$(nproc) in its EXIT trap. With 5+ containers finishing simultaneously, this spawned 300+ gcov processes.

    • Fix: Pass MULTI_GROUP=1 from run-multi-group.bash to test-runners. When set, the EXIT trap copies .gcno adjacent to .gcda (prep work) but skips fastcov/genhtml. Coverage collection runs once centrally after all groups finish, sequentially per group with -j4.
  3. Path mismatch → 0% coverage: GCOV_PREFIX_STRIP=3 placed .gcda at proxysql/lib/obj/ but .gcno was copied to lib/obj/ — gcov couldn't correlate them. Additionally, the --include filter expected full workspace paths but gcov records bare filenames.

    • Fix: .gcno copy now strips leading path components to find the match. Replaced --include with --exclude /usr deps since project source paths are bare filenames.

libparser.so loading fix

  • Add LD_LIBRARY_PATH for unit tests (SKIP_PROXYSQL path) and TAP_DEPS_PATH for Docker-based tests — libcpp_dotenv.so transitively needs libparser.so but has stale RUNPATH from the original build

Diagnostics

  • Add dump_hostgroup_debug() utility function — dumps runtime_mysql_servers and stats_mysql_connection_pool for a given hostgroup. Added to test_throttle, test_warnings, and test_unsupported_queries.

Fixes

Test plan

  • Run test_cluster_sync-t in isolation — no longer stalls, replica connects
  • Run run-multi-group.bash with all groups — no machine stalls during coverage collection
  • Verify coverage reports show >0% for lib/ and src/ (no /usr or deps)
  • Verify test_cluster1-t passes (primary in proxysql_servers, no duplicate INSERT)
  • Verify unit-tests-g1 passes (libparser.so loads correctly)
  • Verify standalone run-tests-isolated.bash still generates coverage

- test_tls_stats-t: plan(29) -> plan(27). The test executes 27 ok()
  calls on the happy path; plan was miscounted during authoring.
- reg_test_4935-caching_sha2-t: remove from legacy-g4 (MySQL 5.7
  doesn't support caching_sha2_password). Keep in mysql84-g4.
- Add dump_hostgroup_debug() utility: dumps runtime_mysql_servers and
  stats_mysql_connection_pool for a given hostgroup via diag() output.
  Helps diagnose "hostgroup has no servers available" CI failures.

- test_throttle_max_bytes_per_second_to_client-t: dump hostgroup 1300,
  0, and 1 when create_table_test_sbtest1() fails.

- test_warnings-t: dump target hostgroup state before running tests.

- test_unsupported_queries-t: dump target hostgroup state before
  running tests.

- test_query_cache_soft_ttl_pct-t: add tolerance of +1 to all three
  assertions. The soft TTL refresh race allows two threads to both
  see "needs refresh" simultaneously under load, producing one extra
  backend hit. This is a known property of the soft TTL mechanism.
Test requires MySQL 8.0+ (detects 5.7 and skips, but crashes with
'terminate called without an active exception' during noise thread
cleanup). The crash-on-skip is a separate test bug.
- pgsql-transaction_variable_state_tracking-t: replace superuser-only
  variable allow_in_place_tablespaces with user-settable enable_mergejoin.
  Fixes #5536.

- test_auth_methods-t: call stop_internal_noise_threads() before early
  return on MySQL 5.7, preventing 'terminate called without an active
  exception' crash from joinable std::thread destructors. Fixes #5537.
…nodes

The primary had admin-debug=false and admin-restapi_enabled=false
(defaults), while cluster nodes had them set to true. Since these
variables are NOT in the CLUSTER_SYNC_INTERFACES_ADMIN exclusion
list, they're included in the checksum. The checksum mismatch
between primary and nodes caused admin_variables sync to never
converge on satellite nodes.

Now the primary sets the same admin-restapi_enabled=true and
admin-debug=true as all nodes, ensuring matching checksums.

Fixes #5535.
test_cluster1-t expects conns[0] to be the primary ProxySQL instance.
TAP_CLUSTER_NODES only listed nodes 1-9 (ports 6042-6122), so the
test loop operated on nodes 1-4 instead of primary + nodes 1-3.

Node 4 (port 6072) is NOT in proxysql_servers (only core nodes 1-3
are registered), so changes on node 4 were invisible to the cluster —
sync could never detect them, causing the 30-second timeout.

Now TAP_CLUSTER_NODES starts with proxysql:6032 (primary), matching
the old cluster_ports layout where 6032 was the first entry.

Fixes #5535.
Copilot AI review requested due to automatic review settings March 23, 2026 16:59
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 23, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Ensure ProxySQL primary endpoint is first in constructed cluster node lists when nodes exist; snapshot/restore GCOV env around proxysql startup; add a hostgroup debug dump helper and call it from several tests; relax some race assertions; adjust TAP plans and test group JSON; broaden coverage file collection.

Changes

Cohort / File(s) Summary
Cluster setup & startup scripts
test/infra/control/env-isolated.bash, test/infra/control/start-proxysql-isolated.bash
Prepend proxysql:6032 to cluster node list when NUM_CLUSTER_NODES > 0; keep CLUSTER_NODES empty when zero; snapshot GCOV_PREFIX/GCOV_PREFIX_STRIP into saved vars before unsetting and re-export them before exec; add admin-cluster_mysql_servers_sync_algorithm=3 to admin config updates.
Coverage collection
test/infra/control/run-tests-isolated.bash
collect_coverage trap now copies both *.gcno and *.gcda into /gcov (broadened find/copy logic).
Test helper: hostgroup debug
test/tap/tap/utils.h, test/tap/tap/utils.cpp
Add void dump_hostgroup_debug(MYSQL* admin, int hostgroup) which queries and logs runtime_mysql_servers and stats_mysql_connection_pool; no-op if admin is NULL.
Tests invoking debug dump
test/tap/tests/test_throttle_max_bytes_per_second_to_client-t.cpp, test/tap/tests/test_unsupported_queries-t.cpp, test/tap/tests/test_warnings-t.cpp
Call dump_hostgroup_debug() in diagnostic/failure paths (when admin access is available) to emit hostgroup runtime/stats state.
Replica targeting in cluster tests
test/tap/tests/test_cluster_sync-t.cpp
Introduce R_HOST = "127.0.0.1" and redirect replica ProxySQL admin/debug connections to that host while keeping replica ports.
Thread cleanup on skip
test/tap/tests/test_auth_methods-t.cpp
On backend-version skip path, call stop_internal_noise_threads() before returning to avoid joinable-thread destruction.
Race-tolerance adjustments
test/tap/tests/test_query_cache_soft_ttl_pct-t.cpp
Relax assertions to accept a small tolerance (1) for concurrent soft-TTL expiry races when counting slow clients, cache hits, and hostgroup hits.
TAP plan & test data changes
test/tap/tests/test_tls_stats-t.cpp, test/tap/groups/groups.json
Reduce TAP plan count from 29→27; remove legacy-g4/legacy-g2 entries in two groups, narrowing lists to mysql84-* entries while keeping flags.
PostgreSQL tracked variable
test/tap/tests/pgsql-transaction_variable_state_tracking-t.cpp
Replace superuser-only allow_in_place_tablespaces with enable_mergejoin to avoid permission failures.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Poem

🐰
I hopped into tests at the break of dawn,
Put the primary proxy proudly on the lawn,
Quieted threads that chattered in the night,
Dug up hostgroup whispers for a clearer light,
And left the logs a little more polite.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 10.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: CI test fixes covering cluster sync, plan counts, test group assignments, and diagnostics.
Linked Issues check ✅ Passed All code changes address the specified objectives: #5536 via pgsql variable replacement, #5537 via stop_internal_noise_threads() call, #5535 via TAP_CLUSTER_NODES primary inclusion and admin config alignment.
Out of Scope Changes check ✅ Passed All changes are directly related to the linked issues: test fixes, group reassignments, cluster sync configuration, GCOV handling, and diagnostic utilities align with PR objectives.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch v3.0-ci260323

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses multiple CI test failures and significantly enhances diagnostic capabilities within the test suite. It includes critical fixes for incorrect test plan counts, introduces necessary tolerance for race conditions in cache timing tests, and resolves issues with test group assignments to ensure compatibility with MySQL versions. Furthermore, the changes correct cluster synchronization logic by properly including and configuring the primary node, and restore accurate code coverage reporting for the primary process. A new debugging utility has been added and integrated into key tests to streamline troubleshooting of hostgroup-related failures.

Highlights

  • Test Fixes: Corrected the expected plan() count in test_tls_stats-t from 29 to 27 to accurately reflect the number of ok() calls. Added a +1 tolerance to cache timing assertions in test_query_cache_soft_ttl_pct-t to account for soft TTL refresh race conditions. Replaced the superuser-only allow_in_place_tablespaces variable with the user-settable enable_mergejoin in pgsql-transaction_variable_state_tracking-t. Ensured stop_internal_noise_threads() is called before early return in test_auth_methods-t to prevent crashes on MySQL 5.7.
  • Test Group Reassignments: Removed reg_test_4935-caching_sha2-t and test_auth_methods-t from legacy-gX test groups as they require MySQL 8.0+.
  • Cluster Sync Fixes: Updated TAP_CLUSTER_NODES to include the primary ProxySQL node (port 6032) as the first entry, addressing test_cluster1-t expectations. Configured admin-restapi_enabled and admin-debug on the primary ProxySQL node to ensure consistent admin variable checksums across the cluster.
  • Coverage Fix: Restored GCOV_PREFIX for the primary ProxySQL process after it was unset for background cluster nodes, fixing coverage report generation.
  • Diagnostics: Introduced a new dump_hostgroup_debug() utility function in utils.h/cpp to aid in diagnosing "hostgroup has no servers" failures by dumping runtime_mysql_servers and stats_mysql_connection_pool information. Integrated dump_hostgroup_debug() into test_throttle, test_warnings, and test_unsupported_queries for enhanced diagnostics.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses multiple TAP/CI failures surfaced during CI migration testing by adjusting flaky timing assertions, fixing test plans/cleanup, correcting cluster node environment/setup for single-container clusters, and adding targeted diagnostics to aid future debugging.

Changes:

  • Fix/relax several TAP tests (plan count correction, timing tolerance for soft TTL race, non-superuser PgSQL variable, and safe cleanup on early-skip paths).
  • Correct isolated cluster startup/env behavior (ensure primary is first in TAP_CLUSTER_NODES, align primary admin vars with nodes, and restore GCOV env for coverage).
  • Add dump_hostgroup_debug() test helper and invoke it in a few tests to print hostgroup/pool state when failures occur.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
test/tap/tests/test_warnings-t.cpp Adds hostgroup debug dump at test start to aid “hostgroup unreachable” diagnosis.
test/tap/tests/test_unsupported_queries-t.cpp Adds hostgroup debug dump and attempts to auto-detect the test user’s default hostgroup.
test/tap/tests/test_tls_stats-t.cpp Corrects TAP plan() count to match executed assertions.
test/tap/tests/test_throttle_max_bytes_per_second_to_client-t.cpp Dumps hostgroup state when table setup fails, improving failure diagnostics.
test/tap/tests/test_query_cache_soft_ttl_pct-t.cpp Adds tolerance to assertions to accommodate a known soft-TTL refresh race.
test/tap/tests/test_auth_methods-t.cpp Stops internal noise threads on early skip to avoid std::thread destructor termination.
test/tap/tests/pgsql-transaction_variable_state_tracking-t.cpp Replaces superuser-only variable with a user-settable alternative.
test/tap/tap/utils.h Declares dump_hostgroup_debug() helper for TAP tests.
test/tap/tap/utils.cpp Implements dump_hostgroup_debug() querying runtime servers and pool stats.
test/tap/groups/groups.json Reassigns MySQL 8.0+ dependent tests out of legacy groups.
test/infra/control/start-proxysql-isolated.bash Restores GCOV env for primary after unsetting for background nodes; aligns primary admin vars with nodes.
test/infra/control/env-isolated.bash Ensures primary (6032) is first in TAP_CLUSTER_NODES for cluster tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +557 to +558
if (dbg_admin && mysql_real_connect(dbg_admin, cl.host, cl.admin_username, cl.admin_password, NULL, cl.admin_port, NULL, 0)) {
dump_hostgroup_debug(dbg_admin, target_hg);
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dbg_admin is only closed on successful mysql_real_connect(). If the connect fails, the handle returned by mysql_init() is leaked. Close dbg_admin whenever it is non-null (even after a failed connect) to keep the test clean and avoid accumulating connections/handles in long CI runs.

Suggested change
if (dbg_admin && mysql_real_connect(dbg_admin, cl.host, cl.admin_username, cl.admin_password, NULL, cl.admin_port, NULL, 0)) {
dump_hostgroup_debug(dbg_admin, target_hg);
if (dbg_admin) {
if (mysql_real_connect(dbg_admin, cl.host, cl.admin_username, cl.admin_password, NULL, cl.admin_port, NULL, 0)) {
dump_hostgroup_debug(dbg_admin, target_hg);
}

Copilot uses AI. Check for mistakes.
Comment on lines +542 to +554
if (dbg_admin && mysql_real_connect(dbg_admin, cl.host, cl.admin_username, cl.admin_password, NULL, cl.admin_port, NULL, 0)) {
// Get default hostgroup for the test user
std::string hg_query = "SELECT default_hostgroup FROM mysql_users WHERE username='" + std::string(cl.username) + "' LIMIT 1";
int target_hg = 0;
if (mysql_query(dbg_admin, hg_query.c_str()) == 0) {
MYSQL_RES *res = mysql_store_result(dbg_admin);
if (res) {
MYSQL_ROW row = mysql_fetch_row(res);
if (row) target_hg = atoi(row[0]);
mysql_free_result(res);
}
}
dump_hostgroup_debug(dbg_admin, target_hg);
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dbg_admin is only closed when mysql_real_connect() succeeds. If the connect fails, the handle created by mysql_init() is leaked. Ensure mysql_close(dbg_admin) is called whenever dbg_admin is non-null, regardless of connection success.

Suggested change
if (dbg_admin && mysql_real_connect(dbg_admin, cl.host, cl.admin_username, cl.admin_password, NULL, cl.admin_port, NULL, 0)) {
// Get default hostgroup for the test user
std::string hg_query = "SELECT default_hostgroup FROM mysql_users WHERE username='" + std::string(cl.username) + "' LIMIT 1";
int target_hg = 0;
if (mysql_query(dbg_admin, hg_query.c_str()) == 0) {
MYSQL_RES *res = mysql_store_result(dbg_admin);
if (res) {
MYSQL_ROW row = mysql_fetch_row(res);
if (row) target_hg = atoi(row[0]);
mysql_free_result(res);
}
}
dump_hostgroup_debug(dbg_admin, target_hg);
if (dbg_admin) {
if (mysql_real_connect(dbg_admin, cl.host, cl.admin_username, cl.admin_password, NULL, cl.admin_port, NULL, 0)) {
// Get default hostgroup for the test user
std::string hg_query = "SELECT default_hostgroup FROM mysql_users WHERE username='" + std::string(cl.username) + "' LIMIT 1";
int target_hg = 0;
if (mysql_query(dbg_admin, hg_query.c_str()) == 0) {
MYSQL_RES *res = mysql_store_result(dbg_admin);
if (res) {
MYSQL_ROW row = mysql_fetch_row(res);
if (row) target_hg = atoi(row[0]);
mysql_free_result(res);
}
}
dump_hostgroup_debug(dbg_admin, target_hg);
}

Copilot uses AI. Check for mistakes.
MYSQL* dbg_admin = mysql_init(NULL);
if (dbg_admin && mysql_real_connect(dbg_admin, cl.host, cl.admin_username, cl.admin_password, NULL, cl.admin_port, NULL, 0)) {
// Get default hostgroup for the test user
std::string hg_query = "SELECT default_hostgroup FROM mysql_users WHERE username='" + std::string(cl.username) + "' LIMIT 1";
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query that looks up default_hostgroup concatenates cl.username directly into SQL without escaping. If the username contains quotes (or other special chars), this breaks the query and can make the diagnostic dump misleading. Use mysql_real_escape_string() (or an existing helper) to escape the username before building the SQL string.

Suggested change
std::string hg_query = "SELECT default_hostgroup FROM mysql_users WHERE username='" + std::string(cl.username) + "' LIMIT 1";
std::string escaped_username;
if (cl.username) {
size_t uname_len = std::strlen(cl.username);
// mysql_real_escape_string can at most double the size of the string, plus one byte for terminator
escaped_username.resize(uname_len * 2 + 1);
unsigned long escaped_len = mysql_real_escape_string(dbg_admin, &escaped_username[0], cl.username, uname_len);
escaped_username.resize(escaped_len);
} else {
escaped_username = "";
}
std::string hg_query = "SELECT default_hostgroup FROM mysql_users WHERE username='" + escaped_username + "' LIMIT 1";

Copilot uses AI. Check for mistakes.
TAP test binaries are compiled with gcov but their .gcda files are
written to the source tree (next to the binaries), not to /gcov.
The coverage collection only scanned /gcov, missing all TAP test
coverage data.

Now copies both .gcno and .gcda files from the workspace into /gcov
before running fastcov, so TAP test coverage is included in reports.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
test/tap/tap/utils.cpp (1)

467-478: Consider wrapping MYSQL_RES* with a smart pointer and custom deleter for better resource management.

This new helper manually manages resultsets with mysql_free_result(). Per coding guidelines for **/*.cpp, RAII should be used for resource management. A smart pointer with a custom deleter wrapping mysql_free_result() would make this safer and align with project standards. This applies to both the runtime_mysql_servers query (lines 467-478) and stats_mysql_connection_pool query (lines 489-500).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/tap/tap/utils.cpp` around lines 467 - 478, Replace manual MYSQL_RES*
management in the runtime_mysql_servers (and likewise in
stats_mysql_connection_pool) block: instead of calling mysql_free_result(res)
explicitly after mysql_store_result(admin), wrap the returned MYSQL_RES* in a
RAII smart pointer with a custom deleter that invokes mysql_free_result (e.g., a
std::unique_ptr<MYSQL_RES, decltype(&mysql_free_result)> or a small
typedef/alias), then use that smart pointer (e.g., res_ptr.get()) wherever
mysql_fetch_row and mysql_num_rows are called so the result is automatically
freed when the pointer goes out of scope; remove the explicit
mysql_free_result(res) call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/tap/tap/utils.cpp`:
- Around line 467-479: In dump_hostgroup_debug, add diagnostics for when
mysql_store_result(admin) returns NULL: after the MYSQL_RES* res =
mysql_store_result(admin) call, detect if res == NULL and call diag to log a
clear message including mysql_error(admin) (and optionally mysql_errno(admin))
so callers can distinguish a failed result retrieval from an empty result set;
keep the existing branch that handles non-NULL results (including the
mysql_num_rows(res) == 0 check) unchanged.

In `@test/tap/tests/pgsql-transaction_variable_state_tracking-t.cpp`:
- Line 281: The test adds "enable_mergejoin" but ProxySQL doesn't track it;
either remove that line from the test or add support for the variable by adding
an entry for enable_mergejoin to the pgsql_variable_name enum (in
proxysql_structs.h), include "enable_mergejoin" in the pgsql_tracked_variables[]
table and the pgsql_other_variables set so ProxySQL will intercept SETs for it;
update all three places (enum, pgsql_tracked_variables[], pgsql_other_variables)
to ensure consistent tracking if you choose to add rather than delete the test
line.

---

Nitpick comments:
In `@test/tap/tap/utils.cpp`:
- Around line 467-478: Replace manual MYSQL_RES* management in the
runtime_mysql_servers (and likewise in stats_mysql_connection_pool) block:
instead of calling mysql_free_result(res) explicitly after
mysql_store_result(admin), wrap the returned MYSQL_RES* in a RAII smart pointer
with a custom deleter that invokes mysql_free_result (e.g., a
std::unique_ptr<MYSQL_RES, decltype(&mysql_free_result)> or a small
typedef/alias), then use that smart pointer (e.g., res_ptr.get()) wherever
mysql_fetch_row and mysql_num_rows are called so the result is automatically
freed when the pointer goes out of scope; remove the explicit
mysql_free_result(res) call.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 24bfee16-492c-4244-91f2-26cab3eb27cb

📥 Commits

Reviewing files that changed from the base of the PR and between d1c86b1 and 0f12f5d.

📒 Files selected for processing (12)
  • test/infra/control/env-isolated.bash
  • test/infra/control/start-proxysql-isolated.bash
  • test/tap/groups/groups.json
  • test/tap/tap/utils.cpp
  • test/tap/tap/utils.h
  • test/tap/tests/pgsql-transaction_variable_state_tracking-t.cpp
  • test/tap/tests/test_auth_methods-t.cpp
  • test/tap/tests/test_query_cache_soft_ttl_pct-t.cpp
  • test/tap/tests/test_throttle_max_bytes_per_second_to_client-t.cpp
  • test/tap/tests/test_tls_stats-t.cpp
  • test/tap/tests/test_unsupported_queries-t.cpp
  • test/tap/tests/test_warnings-t.cpp
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: CI-builds / builds (debian12,-dbg)
  • GitHub Check: CI-builds / builds (ubuntu24,-tap-genai-gcov)
  • GitHub Check: CI-builds / builds (ubuntu22,-tap)
  • GitHub Check: Agent
  • GitHub Check: run / trigger
  • GitHub Check: claude-review
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.{cpp,h,hpp}: Class names must use PascalCase with protocol prefixes (MySQL_, PgSQL_, ProxySQL_)
Member variables must use snake_case
Constants and macros must use UPPER_SNAKE_CASE
C++17 is required; use conditional compilation via #ifdef PROXYSQLGENAI, #ifdef PROXYSQL31, etc. for feature flags
Use pthread mutexes for synchronization and std::atomic<> for counters

Files:

  • test/tap/tests/test_throttle_max_bytes_per_second_to_client-t.cpp
  • test/tap/tests/test_tls_stats-t.cpp
  • test/tap/tests/test_auth_methods-t.cpp
  • test/tap/tests/test_unsupported_queries-t.cpp
  • test/tap/tests/test_warnings-t.cpp
  • test/tap/tests/pgsql-transaction_variable_state_tracking-t.cpp
  • test/tap/tap/utils.h
  • test/tap/tests/test_query_cache_soft_ttl_pct-t.cpp
  • test/tap/tap/utils.cpp
**/*.cpp

📄 CodeRabbit inference engine (CLAUDE.md)

Use RAII for resource management and jemalloc for memory allocation

Files:

  • test/tap/tests/test_throttle_max_bytes_per_second_to_client-t.cpp
  • test/tap/tests/test_tls_stats-t.cpp
  • test/tap/tests/test_auth_methods-t.cpp
  • test/tap/tests/test_unsupported_queries-t.cpp
  • test/tap/tests/test_warnings-t.cpp
  • test/tap/tests/pgsql-transaction_variable_state_tracking-t.cpp
  • test/tap/tests/test_query_cache_soft_ttl_pct-t.cpp
  • test/tap/tap/utils.cpp
test/tap/tests/{test_*,*-t}.cpp

📄 CodeRabbit inference engine (CLAUDE.md)

Test files in test/tap/tests/ must follow the naming pattern test_*.cpp or *-t.cpp

Files:

  • test/tap/tests/test_throttle_max_bytes_per_second_to_client-t.cpp
  • test/tap/tests/test_tls_stats-t.cpp
  • test/tap/tests/test_auth_methods-t.cpp
  • test/tap/tests/test_unsupported_queries-t.cpp
  • test/tap/tests/test_warnings-t.cpp
  • test/tap/tests/pgsql-transaction_variable_state_tracking-t.cpp
  • test/tap/tests/test_query_cache_soft_ttl_pct-t.cpp
🧠 Learnings (8)
📓 Common learnings
Learnt from: yuji-hatakeyama
Repo: sysown/proxysql PR: 5307
File: test/tap/tests/reg_test_5306-show_warnings_with_comment-t.cpp:39-48
Timestamp: 2026-01-20T09:34:27.165Z
Learning: In ProxySQL test files (test/tap/tests/), resource leaks (such as not calling `mysql_close()` on early return paths) are not typically fixed because test processes are short-lived and the OS frees resources on process exit. This is a common pattern across the test suite.
Learnt from: yuji-hatakeyama
Repo: sysown/proxysql PR: 5307
File: test/tap/tests/reg_test_5306-show_warnings_with_comment-t.cpp:24-28
Timestamp: 2026-01-20T07:40:34.938Z
Learning: In ProxySQL test files, calling `mysql_error(NULL)` after `mysql_init()` failure is safe because the MariaDB client library implementation returns an empty string for NULL handles (not undefined behavior).
📚 Learning: 2026-03-22T14:38:16.082Z
Learnt from: CR
Repo: sysown/proxysql PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-22T14:38:16.082Z
Learning: Applies to test/tap/tests/unit/**/*.cpp : Unit tests in test/tap/tests/unit/ must use test_globals.h and test_init.h and link against libproxysql.a via the custom test harness

Applied to files:

  • test/tap/tests/test_throttle_max_bytes_per_second_to_client-t.cpp
  • test/tap/tests/test_tls_stats-t.cpp
  • test/tap/tests/test_auth_methods-t.cpp
  • test/tap/tests/test_unsupported_queries-t.cpp
  • test/tap/tests/test_warnings-t.cpp
  • test/tap/tap/utils.h
  • test/tap/groups/groups.json
  • test/tap/tests/test_query_cache_soft_ttl_pct-t.cpp
  • test/infra/control/env-isolated.bash
  • test/tap/tap/utils.cpp
📚 Learning: 2026-01-20T09:34:19.124Z
Learnt from: yuji-hatakeyama
Repo: sysown/proxysql PR: 5307
File: test/tap/tests/reg_test_5306-show_warnings_with_comment-t.cpp:39-48
Timestamp: 2026-01-20T09:34:19.124Z
Learning: In ProxySQL's TAP test suite, resource leaks (e.g., not calling mysql_close() on early return paths) are commonly tolerated because test processes are short-lived and OS frees resources on exit. This pattern applies to all C++ test files under test/tap/tests. When reviewing, recognize this as a project-wide test convention and focus on test correctness and isolation rather than insisting on fixing such leaks in these test files.

Applied to files:

  • test/tap/tests/test_throttle_max_bytes_per_second_to_client-t.cpp
  • test/tap/tests/test_tls_stats-t.cpp
  • test/tap/tests/test_auth_methods-t.cpp
  • test/tap/tests/test_unsupported_queries-t.cpp
  • test/tap/tests/test_warnings-t.cpp
  • test/tap/tests/pgsql-transaction_variable_state_tracking-t.cpp
  • test/tap/tests/test_query_cache_soft_ttl_pct-t.cpp
📚 Learning: 2026-01-20T07:40:34.938Z
Learnt from: yuji-hatakeyama
Repo: sysown/proxysql PR: 5307
File: test/tap/tests/reg_test_5306-show_warnings_with_comment-t.cpp:24-28
Timestamp: 2026-01-20T07:40:34.938Z
Learning: In ProxySQL test files, calling `mysql_error(NULL)` after `mysql_init()` failure is safe because the MariaDB client library implementation returns an empty string for NULL handles (not undefined behavior).

Applied to files:

  • test/tap/tests/test_throttle_max_bytes_per_second_to_client-t.cpp
  • test/tap/tests/test_tls_stats-t.cpp
  • test/tap/tests/test_auth_methods-t.cpp
  • test/tap/tests/test_unsupported_queries-t.cpp
  • test/tap/tests/test_warnings-t.cpp
  • test/infra/control/start-proxysql-isolated.bash
  • test/tap/tap/utils.cpp
📚 Learning: 2026-02-13T09:29:39.713Z
Learnt from: wazir-ahmed
Repo: sysown/proxysql PR: 5372
File: test/tap/tap/mcp_client.cpp:355-385
Timestamp: 2026-02-13T09:29:39.713Z
Learning: In ProxySQL MCP implementation (test/tap/tap/mcp_client.cpp), the `check_server()` method uses the ping endpoint which is designed to work without authentication. The `ping` method at the `config` endpoint should not require the `Authorization: Bearer` header, unlike tool invocation endpoints which do require authentication when `auth_token_` is set.

Applied to files:

  • test/tap/tests/test_auth_methods-t.cpp
📚 Learning: 2026-03-22T14:38:16.082Z
Learnt from: CR
Repo: sysown/proxysql PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-22T14:38:16.082Z
Learning: Applies to **/*.{cpp,h,hpp} : C++17 is required; use conditional compilation via `#ifdef` PROXYSQLGENAI, `#ifdef` PROXYSQL31, etc. for feature flags

Applied to files:

  • test/tap/tests/test_auth_methods-t.cpp
  • test/tap/tests/test_unsupported_queries-t.cpp
  • test/tap/tests/test_warnings-t.cpp
📚 Learning: 2026-01-20T09:34:27.165Z
Learnt from: yuji-hatakeyama
Repo: sysown/proxysql PR: 5307
File: test/tap/tests/reg_test_5306-show_warnings_with_comment-t.cpp:39-48
Timestamp: 2026-01-20T09:34:27.165Z
Learning: In ProxySQL test files (test/tap/tests/), resource leaks (such as not calling `mysql_close()` on early return paths) are not typically fixed because test processes are short-lived and the OS frees resources on process exit. This is a common pattern across the test suite.

Applied to files:

  • test/infra/control/start-proxysql-isolated.bash
  • test/tap/groups/groups.json
  • test/infra/control/env-isolated.bash
📚 Learning: 2026-03-22T14:38:16.082Z
Learnt from: CR
Repo: sysown/proxysql PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-22T14:38:16.082Z
Learning: Feature tiers are controlled by build flags: PROXYSQL31=1 for v3.1.x, PROXYSQLGENAI=1 for v4.0.x; PROXYSQLGENAI=1 implies PROXYSQL31=1

Applied to files:

  • test/infra/control/start-proxysql-isolated.bash
🔇 Additional comments (12)
test/tap/tests/test_query_cache_soft_ttl_pct-t.cpp (2)

182-194: LGTM! Tolerance logic correctly addresses soft-TTL race condition.

The range check [expected_num_slow_clients, expected_num_slow_clients + tolerance] properly handles the case where multiple threads may observe the cache entry as needing refresh before any single thread completes the backend query. The diagnostic message clearly shows the expected range and actual value.


198-213: LGTM! Cache and hostgroup tolerance correctly maintains invariant.

The tolerance relationship is sound: one additional hostgroup hit (backend query) corresponds to one fewer cache hit, preserving the total query count. The computed deltas (stats_after - stats_before) properly isolate this test run's contributions.

test/tap/tests/test_tls_stats-t.cpp (1)

123-125: LGTM! Plan count correction is accurate.

Verified the test count: 27 ok() calls on the happy path matches the corrected plan(27). The comment updates for Parts 3-4 (tests 23-27) are consistent with the code.

Minor observation: The earlier comments (lines 118-122) have some numbering inconsistencies with actual test counts (e.g., // 1-5: but there are 6 TLS variables), but since those weren't part of this change, it's fine to address separately if desired.

test/tap/groups/groups.json (1)

185-185: Good test-group compatibility alignment.

These updates correctly scope MySQL 8.0+-dependent tests away from legacy groups while keeping the intended variant groups.

Also applies to: 220-220

test/infra/control/env-isolated.bash (1)

62-77: Primary-first cluster ordering is correctly implemented.

This change makes TAP_CLUSTER_NODES deterministic and consistent with cluster tests that expect the primary endpoint first.

test/tap/tests/test_throttle_max_bytes_per_second_to_client-t.cpp (1)

48-50: Helpful failure-path diagnostics added.

Dumping hostgroup state here is a good CI-debug improvement and keeps behavior isolated to the error path.

test/tap/tests/test_auth_methods-t.cpp (1)

1144-1145: Early-return thread cleanup is correctly fixed.

Calling stop_internal_noise_threads() before returning on the skip path is the right fix to avoid thread-destructor crashes.

test/tap/tests/test_unsupported_queries-t.cpp (1)

539-557: Diagnostic preflight block is a good addition.

This improves observability for CI failures without altering the existing unsupported-query test flow.

test/tap/tests/test_warnings-t.cpp (1)

554-561: Nice observability improvement for warning-path failures.

The hostgroup dump hook is scoped well and should help diagnose intermittent CI issues.

test/infra/control/start-proxysql-isolated.bash (2)

57-60: GCOV environment preservation is correctly handled.

Saving before unsetting and restoring for the primary process is the right approach for reliable coverage collection.

Also applies to: 103-106


182-188: Primary admin-variable alignment looks correct.

Setting admin-restapi_enabled and admin-debug on primary to match nodes should help cluster sync converge consistently.

test/tap/tap/utils.h (1)

159-165: Utility API addition is clean and well-documented.

The declaration and comments are clear and fit the new diagnostics usage across TAP tests.

Comment thread test/tap/tap/utils.cpp
Comment on lines +467 to +479
MYSQL_RES* res = mysql_store_result(admin);
if (res) {
MYSQL_ROW row;
diag("runtime_mysql_servers (hg=%d):", hostgroup);
while ((row = mysql_fetch_row(res))) {
diag(" hg=%s host=%s port=%s status=%s max_conn=%s comment=%s",
row[0], row[1], row[2], row[3], row[4], row[5] ? row[5] : "");
}
if (mysql_num_rows(res) == 0) {
diag(" (no servers in hostgroup %d)", hostgroup);
}
mysql_free_result(res);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the two mysql_store_result() branches in dump_hostgroup_debug
# and whether explicit NULL-result diagnostics are present.
rg -n -C4 'dump_hostgroup_debug|mysql_store_result|Failed to store result' test/tap/tap/utils.cpp

Repository: sysown/proxysql

Length of output: 6720


Add explicit diagnostics when mysql_store_result() fails.

The dump_hostgroup_debug function silently ignores NULL returns from mysql_store_result(), making it difficult to distinguish between "query succeeded with empty result" and "failed to store result." Add diagnostics for both NULL cases using mysql_error(admin).

Suggested patch
 	if (mysql_query(admin, query) == 0) {
 		MYSQL_RES* res = mysql_store_result(admin);
 		if (res) {
 			MYSQL_ROW row;
 			diag("runtime_mysql_servers (hg=%d):", hostgroup);
 			while ((row = mysql_fetch_row(res))) {
 				diag("  hg=%s host=%s port=%s status=%s max_conn=%s comment=%s",
 					row[0], row[1], row[2], row[3], row[4], row[5] ? row[5] : "");
 			}
 			if (mysql_num_rows(res) == 0) {
 				diag("  (no servers in hostgroup %d)", hostgroup);
 			}
 			mysql_free_result(res);
+		} else {
+			diag("  Failed to store result for runtime_mysql_servers: %s", mysql_error(admin));
 		}
 	}
 	if (mysql_query(admin, query) == 0) {
 		MYSQL_RES* res = mysql_store_result(admin);
 		if (res) {
 			MYSQL_ROW row;
 			diag("stats_mysql_connection_pool (hg=%d):", hostgroup);
 			while ((row = mysql_fetch_row(res))) {
 				diag("  hg=%s host=%s port=%s status=%s ConnUsed=%s ConnFree=%s ConnOK=%s ConnERR=%s Queries=%s",
 					row[0], row[1], row[2], row[3], row[4], row[5], row[6], row[7], row[9]);
 			}
 			if (mysql_num_rows(res) == 0) {
 				diag("  (no pool entries for hostgroup %d)", hostgroup);
 			}
 			mysql_free_result(res);
+		} else {
+			diag("  Failed to store result for stats_mysql_connection_pool: %s", mysql_error(admin));
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/tap/tap/utils.cpp` around lines 467 - 479, In dump_hostgroup_debug, add
diagnostics for when mysql_store_result(admin) returns NULL: after the
MYSQL_RES* res = mysql_store_result(admin) call, detect if res == NULL and call
diag to log a clear message including mysql_error(admin) (and optionally
mysql_errno(admin)) so callers can distinguish a failed result retrieval from an
empty result set; keep the existing branch that handles non-NULL results
(including the mysql_num_rows(res) == 0 check) unchanged.

{"timezone", {"timezone", {"UTC", "PST8PDT"}}},
{"bytea_output", {"bytea_output", {"hex", "escape"}}},
{"allow_in_place_tablespaces", {"allow_in_place_tablespaces", {"on", "off"}}},
{"enable_mergejoin", {"enable_mergejoin", {"on", "off"}}},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify enable_mergejoin is NOT in tracked variables

echo "=== Checking pgsql_tracked_variables array ==="
rg -n "enable_mergejoin" include/proxysql_structs.h

echo ""
echo "=== Checking pgsql_other_variables set ==="
rg -n "enable_mergejoin" lib/PgSQL_Session.cpp

echo ""
echo "=== Checking pgsql_variable_name enum ==="
rg -n "ENABLE_MERGEJOIN" include/proxysql_structs.h

echo ""
echo "=== For comparison, showing tracked enable_* variables ==="
rg -n "enable_" include/proxysql_structs.h | head -20

Repository: sysown/proxysql

Length of output: 1515


enable_mergejoin is not in ProxySQL's tracked variables, making this test ineffective.

The variable replacement removes tracking coverage. enable_mergejoin is absent from pgsql_tracked_variables[], pgsql_other_variables, and the pgsql_variable_name enum, while other enable_* variables (enable_bitmapscan, enable_hashjoin, enable_indexscan, enable_nestloop, enable_seqscan, enable_sort) are tracked. ProxySQL will not intercept SET statements for this variable, causing the test to rely on PostgreSQL's native behavior rather than validate ProxySQL's session variable tracking mechanism.

Either add enable_mergejoin to the tracking infrastructure (enum in proxysql_structs.h, entry in pgsql_tracked_variables[], and pgsql_other_variables set) or remove this line since the other enable_* variables already provide test coverage.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/tap/tests/pgsql-transaction_variable_state_tracking-t.cpp` at line 281,
The test adds "enable_mergejoin" but ProxySQL doesn't track it; either remove
that line from the test or add support for the variable by adding an entry for
enable_mergejoin to the pgsql_variable_name enum (in proxysql_structs.h),
include "enable_mergejoin" in the pgsql_tracked_variables[] table and the
pgsql_other_variables set so ProxySQL will intercept SETs for it; update all
three places (enum, pgsql_tracked_variables[], pgsql_other_variables) to ensure
consistent tracking if you choose to add rather than delete the test line.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new dump_hostgroup_debug utility function to aid in debugging CI failures by providing diagnostic information about hostgroups and connection pools. This function is integrated into several existing tests. The PR also refines the cluster node configuration, ensures proper handling of GCOV environment variables for the primary process, and adjusts test assertions for query cache soft TTL to include a tolerance for race conditions. A review comment highlights that the dump_hostgroup_debug function's output for stats_mysql_connection_pool is missing the MaxConnUsed field, which should be included for complete debugging information.

Comment thread test/tap/tap/utils.cpp
Comment on lines +494 to +495
diag(" hg=%s host=%s port=%s status=%s ConnUsed=%s ConnFree=%s ConnOK=%s ConnERR=%s Queries=%s",
row[0], row[1], row[2], row[3], row[4], row[5], row[6], row[7], row[9]);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The diagnostic output for stats_mysql_connection_pool is missing the MaxConnUsed field, even though it's selected in the query on line 486. This can lead to incomplete debugging information. Please update the diag call to include this field.

                diag("  hg=%s host=%s port=%s status=%s ConnUsed=%s ConnFree=%s ConnOK=%s ConnERR=%s MaxConnUsed=%s Queries=%s",
                    row[0], row[1], row[2], row[3], row[4], row[5], row[6], row[7], row[8], row[9]);

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/infra/control/run-tests-isolated.bash (1)

272-282: LGTM! Coverage collection now correctly captures TAP test data.

The extension to copy both .gcno and .gcda files is correct—fastcov requires both to generate coverage reports. The comments clearly explain the dual-source pattern: ProxySQL writes .gcda via GCOV_PREFIX, while TAP tests write to the source tree.

Minor optional improvement: using read -r would prevent backslash interpretation in filenames, though unlikely to matter for auto-generated coverage paths.

,

🔧 Optional: more robust read loop
-                        cd \"\${WORKSPACE}\" && find . -path './ci_infra_logs' -prune -o \\( -name '*.gcno' -o -name '*.gcda' \\) -type f -print | while read gcfile; do
+                        cd \"\${WORKSPACE}\" && find . -path './ci_infra_logs' -prune -o \\( -name '*.gcno' -o -name '*.gcda' \\) -type f -print | while IFS= read -r gcfile; do
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/infra/control/run-tests-isolated.bash` around lines 272 - 282, The
while-loop that iterates over discovered coverage files in
run-tests-isolated.bash uses a plain read which can interpret backslashes in
filenames; update the loop to use read -r when reading gcfile to prevent
backslash interpretation (locate the find ... | while read gcfile; do loop and
change it to use read -r gcfile), ensuring the rest of the logic (target,
target_dir, mkdir -p, cp -f steps) remains unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/infra/control/run-tests-isolated.bash`:
- Around line 272-282: The while-loop that iterates over discovered coverage
files in run-tests-isolated.bash uses a plain read which can interpret
backslashes in filenames; update the loop to use read -r when reading gcfile to
prevent backslash interpretation (locate the find ... | while read gcfile; do
loop and change it to use read -r gcfile), ensuring the rest of the logic
(target, target_dir, mkdir -p, cp -f steps) remains unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a32d0f7e-404f-4374-96c0-1afb8cddb451

📥 Commits

Reviewing files that changed from the base of the PR and between 0f12f5d and fb9ff44.

📒 Files selected for processing (1)
  • test/infra/control/run-tests-isolated.bash
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: CI-builds / builds (ubuntu24,-tap-genai-gcov)
  • GitHub Check: CI-builds / builds (ubuntu22,-tap)
  • GitHub Check: CI-builds / builds (debian12,-dbg)
  • GitHub Check: run / trigger
  • GitHub Check: claude-review
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: yuji-hatakeyama
Repo: sysown/proxysql PR: 5307
File: test/tap/tests/reg_test_5306-show_warnings_with_comment-t.cpp:39-48
Timestamp: 2026-01-20T09:34:27.165Z
Learning: In ProxySQL test files (test/tap/tests/), resource leaks (such as not calling `mysql_close()` on early return paths) are not typically fixed because test processes are short-lived and the OS frees resources on process exit. This is a common pattern across the test suite.
Learnt from: CR
Repo: sysown/proxysql PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-22T14:38:16.082Z
Learning: Applies to test/tap/tests/unit/**/*.cpp : Unit tests in test/tap/tests/unit/ must use test_globals.h and test_init.h and link against libproxysql.a via the custom test harness

The default value (1) syncs both configured AND runtime mysql_servers,
including changes made by the monitor (read_only checks moving servers
between writer/reader hostgroups). This caused circular sync: a
cluster node's monitor modified servers, creating a newer epoch, and
the primary synced this back — overwriting its correct config and
saving the corrupted version to disk.

With auto_select (3), the algorithm chooses mysql_servers_v2 when
the monitor is enabled, which only syncs user-configured changes
and ignores monitor-triggered runtime modifications.

Evidence: node 6042's monitor removed MariaDB servers from hostgroup
1701 ("Setting status OFFLINE HARD"), creating checksum 0x7C0A67D5
with epoch 1774288555. Primary (epoch 1774288549) synced from node
6042 and saved to disk, losing hostgroup 1300 permanently.

Fixes #5540.
The test spawns its own ProxySQL replica (port 16062) inside the
test-runner container, but connected to it via cl.host (TAP_ADMINHOST)
which resolves to the main ProxySQL container in the Docker setup.
Since port 16062 doesn't exist on the main container, the connection
hung indefinitely, causing the test to stall.

Now uses R_HOST="127.0.0.1" for all replica connections (R_PORT),
since the replica runs locally inside the same container as the test.
…ca config

The replica used the default algorithm (1) which syncs runtime
mysql_servers — including monitor-shunned status. The test checks
the config table (mysql_servers) for status='ONLINE', but the
runtime sync overwrites the config table with SHUNNED status.

With auto_select (3) and monitor enabled (-M flag), the replica
uses mysql_servers_v2 sync which preserves the configured status.
The debug filter for Query_Processor.cpp/process_query suppressed
the "Searching thread-local 'rules_fast_routing' hashmap" messages
that test_query_rules_fast_routing_algorithm-t relies on to verify
fast routing behavior (expects 20 log entries, gets 0).
When a cluster node syncs proxysql_servers from the primary and runs
LOAD PROXYSQL SERVERS TO RUNTIME, it rebuilds the node list from the
synced data. If the primary isn't in its own proxysql_servers table,
the synced list won't include it. The node then destroys the cluster
monitor thread for the primary and never recreates it.

Evidence: node 1 log shows "Destroyed Cluster Node Entry for host
proxysql:6032" followed by "closing thread for peer proxysql:6032"
after syncing proxysql_servers from the primary. After that, the
node never detected any further checksum changes from the primary,
causing test_cluster1-t and test_unshun_algorithm-t to fail.

Fixes #5542.
The coverage collection copied ALL .gcno and .gcda files from the
workspace to /gcov using cp -f. But ProxySQL's .gcda files (with
real execution data from the Docker container) were already in /gcov
via GCOV_PREFIX. The cp -f overwrote them with the host's stale
.gcda files (zero or outdated execution counts).

This caused lib/MySQLFFTO.cpp and lib/PgSQLFFTO.cpp (and likely
other ProxySQL library files) to show 0% coverage despite the FFTO
tests running successfully and the .gcda files having real data.

Fix: copy .gcno files with cp -f (always needed for fastcov), but
copy .gcda files with cp -n (no-clobber) so ProxySQL's existing
execution data in /gcov is preserved. TAP test .gcda files that
don't exist in /gcov yet are still copied.
Coverage data comes from two sources that may overlap:
  /gcov     — ProxySQL daemon execution (via GCOV_PREFIX)
  WORKSPACE — TAP test execution (.gcda next to binaries)

Both may cover the same code (e.g. libproxysql.a exercised by both
ProxySQL runtime and TAP test harness). Previously we copied .gcda
files between directories, which either overwrote (cp -f) or kept
only the first version (cp -n) instead of combining counters.

Now generates two separate .info files (proxysql.info + tap_tests.info)
and combines them with fastcov -C, which correctly sums execution
counters for overlapping source files.
- Remove redundant INSERT of primary into proxysql_servers on nodes.
  Commit b02147b added the primary to the shared PROXYSQL_SERVERS_SQL
  but didn't remove the pre-existing per-node INSERT, causing a UNIQUE
  constraint violation that broke all multi-node CI groups.

- Add cluster_mysql_servers_sync_algorithm=3 to test_cluster_sync config
  files and update hardcoded datadir paths to match current workspace.
TAP test binaries link libcpp_dotenv.so which transitively needs
libparser.so. The RUNPATH baked into libcpp_dotenv.so contains stale
paths from the original build, so the dynamic linker can't find it.

- Add LD_LIBRARY_PATH for the SKIP_PROXYSQL (unit test) host path
- Add TAP_DEPS_PATH to env-isolated.bash so proxysql-tester.py
  constructs the correct LD_LIBRARY_PATH inside Docker containers
Three problems caused machine stalls and 0% coverage in multi-group runs:

1. All test-runner containers wrote TAP test .gcda files to the shared
   ${WORKSPACE}, causing concurrent corruption and memory bloat when
   multiple gcov processes fought over the same files.

2. Each container independently ran fastcov -j$(nproc) in its EXIT trap,
   spawning hundreds of concurrent gcov processes across all containers.

3. GCOV_PREFIX_STRIP=3 placed .gcda at proxysql/lib/obj/ but .gcno was
   copied to lib/obj/ — path mismatch meant gcov couldn't correlate them.
   Additionally, the --include filter expected full workspace paths but
   gcov records bare filenames, so all results were filtered out.

Changes:

- Set GCOV_PREFIX=/gcov/tap on test-runner containers so TAP .gcda files
  go to per-INFRA_ID /gcov/tap/ instead of shared workspace.

- Pass MULTI_GROUP=1 from run-multi-group.bash to test-runners. When set,
  the EXIT trap copies .gcno adjacent to .gcda (handling path offset from
  GCOV_PREFIX_STRIP) but skips fastcov/genhtml.

- Add centralized coverage collection in run-multi-group.bash: processes
  each group's /gcov directory sequentially with fastcov -j4 after all
  groups finish, then combines the .info files.

- Remove --include filter from fastcov (source paths are bare filenames,
  not full paths, so the filter rejected everything).

- Standalone mode (no MULTI_GROUP) is unchanged.
fastcov was including system headers (/usr/include/c++/...) and vendored
dependencies (deps/json/json.hpp) in coverage output. Use --exclude
instead of --include since project source files have bare filenames in
gcov output (no path prefix to match against), while system/deps paths
are always absolute.
- mysql84 pre-proxysql.bash: use docker exec for all ProxySQL admin
  commands (was using mysql -h127.0.0.1 which can't reach the container)

- basictests setup-infras.bash: use INSERT OR REPLACE when remapping
  hostgroups 1300/1301 -> 0/1 to avoid UNIQUE constraint violation
  when target hostgroups already have entries

- proxysql-tester.py: skip per-stage coverage generation when
  MULTI_GROUP=1 (was spawning 32 gcov processes on 35k .gcno files)

- Move test_cluster_sync-t and test_read_only_actions_offline_hard_servers-t
  to legacy-g5 pending investigation of cluster sync algorithm issues.
  Move reg_test_3765_ssl_pollout-t from legacy-g5 to legacy-g4.
@renecannao renecannao merged commit 2350632 into v3.0 Mar 24, 2026
3 checks passed
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants