Skip to content

Frequent Galera Health check timeouts #5624

@ThomasVerhoeven1998

Description

@ThomasVerhoeven1998

Issue Summary

ProxySQL intermittently marks healthy Galera nodes as OFFLINE due to health check timeouts. This behavior is most consistent immediately after ProxySQL startup, where we frequently see >5 timeouts within the first few seconds. And during traffic this keeps on happening a lot more than we expect.

The mysql_server_ping_log and mysql_server_connect_log confirm the network and MySQL listeners are highly responsive (latencies in microseconds). However, the mysql_server_galera_log shows success times spiking into hundreds of milliseconds or failing entirely. This, combined with the Watchdog error, points to internal thread starvation rather than actual backend database issues.

Observed Errors

2026-04-15 04:00:46 MySQL_Monitor.cpp:2633:monitor_galera_thread(): [ERROR] Got error. mmsd 0x7fd94c210200 , MYSQL 0x7fd94b3f2d00 , FD 75 : timeout check

Server sql01:3306 missed 3 Galera checks. Assuming offline

Timeout on Galera health check for svevert8sql01:3306 after 1507ms. 

Watchdog: 4 MySQL threads missed a heartbeat

Environment

ProxySQL Version: 3.0.5-centos
Replicas: 3
cpu cores requests: 4 per instance
memory allocation: 3 Gi per instance

(resource usage is very stable and well within bounds)

Deployment Architecture & Traffic Flow

  • Traffic Pattern: We operate a single-writer topology. 100% of client traffic is directed to sql01
  • Replication: Nodes sql02 and sql03 are maintained strictly for synchronous replication and failover purposes; they do not handle active client queries under normal conditions.
  • Impact: Because all traffic is localized to Node 1, any false-positive "OFFLINE" status triggered by the monitor causes an unnecessary and disruptive cluster failover.

Connection Usage Context

Based on our current metrics:

  • Client Connections: ~1,300 active connections per ProxySQL instance.
  • Backend Connections (In-Use): Peaks of 300 to 700 connections.
  • Multiplexing: false (due to use of transaction, medium-long)

Relevant Configuration

We have adjusted several monitor and connection timeouts to mitigate this, but the issue persists on startup:

| Variable                                   | Value     | Description                                        |
|:-------------------------------------------|:----------|:---------------------------------------------------|
| `mysql-threads`                            | **4**     | Number of worker threads.                          |
| `mysql-monitor_threads_min`                | **4**     | Min threads for monitoring tasks.                  |
| `mysql-monitor_threads_max`                | **128**   | Max threads for monitoring tasks.                  |
| `mysql-monitor_galera_healthcheck_timeout` | **1500**  | Max time for a Galera health check (ms).           |
| `mysql-monitor_connect_timeout`            | **1000**  | Timeout for monitor connection establishment (ms). |
| `mysql-monitor_ping_timeout`               | **1500**  | Timeout for monitor ICMP/Ping (ms).                |
| `mysql-monitor_read_only_timeout`          | **1000**  | Timeout for read-only checks (ms).                 |
| `mysql-monitor_query_timeout`              | **1000**  | Timeout for monitor queries (ms).                  |
| `mysql-connect_timeout_server`             | **1000**  | Max TCP connection timeout to backends (ms).       |
| `mysql-connect_timeout_server_max`         | **10000** | Max TCP connection timeout to backends (ms).       |
| `mysql-poll_timeout`                       | **500**   | Timeout for internal poll() calls (ms).            |

Health Log Comparison

1. mysql_server_ping_log (Network Responsiveness)

The network is clear; backends respond in microseconds.

| Hostname      | success_time_us | Latency (ms) | Status    |
|:--------------|:----------------|:-------------|:----------|
| sql02 | 133             | 0.13ms       | Excellent |
| sql01 | 142             | 0.14ms       | Excellent |
| sql01 | 215             | 0.21ms       | Good      |
| sql02 | 233             | 0.23ms       | Good      |
| sql01 | 413             | 0.41ms       | Good      |
| sql01 | 449             | 0.44ms       | Good      |
| sql03 | 1381            | 1.38ms       | Average   |
| sql03 | 1470            | 1.47ms       | Average   |
| sql03 | 1513            | 1.51ms       | Average   |
| sql03 | 1761            | 1.76ms       | Worse     |
| sql03 | 2002            | 2.00ms       | Worse     |
| sql03 | 2672            | 2.67ms       | Worst     |

2. mysql_server_connect_log (TCP Connection)

TCP handshake remains stable (13ms - 35ms).

| Hostname      | connect_success_time_us | Latency (ms) | Status  |
|:--------------|:------------------------|:-------------|:--------|
| sql01 | 13060                   | 13.0ms       | Good    |
| sql02 | 13116                   | 13.1ms       | Good    |
| sql01 | 13256                   | 13.2ms       | Good    |
| sql02 | 13388                   | 13.3ms       | Good    |
| sql01 | 14669                   | 14.6ms       | Good    |
| sql02 | 16837                   | 16.8ms       | Average |
| sql03 | 20358                   | 20.3ms       | Average |
| sql03 | 21564                   | 21.5ms       | Average |
| sql01 | 22278                   | 22.2ms       | Average |
| sql01 | 25691                   | 25.6ms       | Worse   |
| sql03 | 29117                   | 29.1ms       | Worse   |
| sql03 | 35098                   | 35.0ms       | Worst   |

3. mysql_server_galera_log (Monitor Internal Stalling)

Massive discrepancy: Galera success times are up to 100x slower than TCP connects.

| Hostname      | success_time_us | Latency (ms) | Status         |
|:--------------|:----------------|:-------------|:---------------|
| sql02 | 2464            | 2.4ms        | Good           |
| sql02 | 2946            | 2.9ms        | Good           |
| sql01 | 4823            | 4.8ms        | Good           |
| sql01 | 7195            | 7.1ms        | Good           |
| sql03 | 7637            | 7.6ms        | Good           |
| sql01 | 11422           | 11.4ms       | Average        |
| sql03 | 21440           | 21.4ms       | Average        |
| sql01 | 38058           | 38.0ms       | Worse          |
| sql03 | 47764           | 47.7ms       | Worse          |
| sql02 | 102062          | 102.0ms      | Critical Spike |
| sql02 | 106698          | 106.6ms      | Critical Spike |
| sql01 | **(Timeout)**   | **>1500ms**  | **Failed**     |

Analysis & Conclusion

  1. Thread startvation: Is there thread starvation? The immediate failure on startup suggests that ProxySQL threads are
    saturated
  2. Internal Bottleneck: The discrepancy between pings (0.1ms) and Galera checks (>1500ms) indicates ProxySQL is failing to process results because worker threads are blocked.
  3. Potential Contention: With mysql-threads set to 4, the initialization phase likely triggers mutex contention or CPU exhaustion, causing the Watchdog to fire and health checks to time out before the internal scheduler can even register the database's response which is weird cause we do not see any spikes in CPU usage at that time

Request: Please investigate potential lock contention in MySQL_Monitor.cpp during periods of high connection establishment, especially at startup and periods of higher load

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions