Frequent Galera Health check timeouts

**Issue Summary**

ProxySQL intermittently marks healthy Galera nodes as `OFFLINE` due to health check timeouts. This behavior is most consistent immediately after ProxySQL startup, where we frequently see >5 timeouts within the first few seconds. And during traffic this keeps on happening a lot more than we expect.

The `mysql_server_ping_log` and `mysql_server_connect_log` confirm the network and MySQL listeners are highly responsive (latencies in microseconds). However, the mysql_server_galera_log shows success times spiking into hundreds of milliseconds or failing entirely. This, combined with the Watchdog error, points to internal thread starvation rather than actual backend database issues.

**Observed Errors**

```
2026-04-15 04:00:46 MySQL_Monitor.cpp:2633:monitor_galera_thread(): [ERROR] Got error. mmsd 0x7fd94c210200 , MYSQL 0x7fd94b3f2d00 , FD 75 : timeout check

Server sql01:3306 missed 3 Galera checks. Assuming offline

Timeout on Galera health check for svevert8sql01:3306 after 1507ms. 

Watchdog: 4 MySQL threads missed a heartbeat
```

**Environment**

ProxySQL Version: 3.0.5-centos
Replicas: 3
cpu cores requests: 4 per instance
memory allocation: 3 Gi per instance 

(resource usage is very stable and well within bounds) 

### Deployment Architecture & Traffic Flow
- **Traffic Pattern:** We operate a single-writer topology. **100% of client traffic** is directed to `sql01` 
- **Replication:** Nodes `sql02` and `sql03` are maintained strictly for synchronous replication and failover purposes; they do not handle active client queries under normal conditions.
- **Impact:** Because all traffic is localized to Node 1, any false-positive "OFFLINE" status triggered by the monitor causes an unnecessary and disruptive cluster failover.

### Connection Usage Context

Based on our current metrics:

- **Client Connections:** ~1,300  active connections per ProxySQL instance.
- **Backend Connections (In-Use):** Peaks of 300 to 700 connections.
- **Multiplexing**: false (due to use of transaction, medium-long)

**Relevant Configuration**

We have adjusted several monitor and connection timeouts to mitigate this, but the issue persists on startup:

```
| Variable                                   | Value     | Description                                        |
|:-------------------------------------------|:----------|:---------------------------------------------------|
| `mysql-threads`                            | **4**     | Number of worker threads.                          |
| `mysql-monitor_threads_min`                | **4**     | Min threads for monitoring tasks.                  |
| `mysql-monitor_threads_max`                | **128**   | Max threads for monitoring tasks.                  |
| `mysql-monitor_galera_healthcheck_timeout` | **1500**  | Max time for a Galera health check (ms).           |
| `mysql-monitor_connect_timeout`            | **1000**  | Timeout for monitor connection establishment (ms). |
| `mysql-monitor_ping_timeout`               | **1500**  | Timeout for monitor ICMP/Ping (ms).                |
| `mysql-monitor_read_only_timeout`          | **1000**  | Timeout for read-only checks (ms).                 |
| `mysql-monitor_query_timeout`              | **1000**  | Timeout for monitor queries (ms).                  |
| `mysql-connect_timeout_server`             | **1000**  | Max TCP connection timeout to backends (ms).       |
| `mysql-connect_timeout_server_max`         | **10000** | Max TCP connection timeout to backends (ms).       |
| `mysql-poll_timeout`                       | **500**   | Timeout for internal poll() calls (ms).            |

```


### Health Log Comparison

#### 1. mysql_server_ping_log (Network Responsiveness)

*The network is clear; backends respond in microseconds.*

```
| Hostname      | success_time_us | Latency (ms) | Status    |
|:--------------|:----------------|:-------------|:----------|
| sql02 | 133             | 0.13ms       | Excellent |
| sql01 | 142             | 0.14ms       | Excellent |
| sql01 | 215             | 0.21ms       | Good      |
| sql02 | 233             | 0.23ms       | Good      |
| sql01 | 413             | 0.41ms       | Good      |
| sql01 | 449             | 0.44ms       | Good      |
| sql03 | 1381            | 1.38ms       | Average   |
| sql03 | 1470            | 1.47ms       | Average   |
| sql03 | 1513            | 1.51ms       | Average   |
| sql03 | 1761            | 1.76ms       | Worse     |
| sql03 | 2002            | 2.00ms       | Worse     |
| sql03 | 2672            | 2.67ms       | Worst     |
```


#### 2. mysql_server_connect_log (TCP Connection)

*TCP handshake remains stable (13ms - 35ms).*

```
| Hostname      | connect_success_time_us | Latency (ms) | Status  |
|:--------------|:------------------------|:-------------|:--------|
| sql01 | 13060                   | 13.0ms       | Good    |
| sql02 | 13116                   | 13.1ms       | Good    |
| sql01 | 13256                   | 13.2ms       | Good    |
| sql02 | 13388                   | 13.3ms       | Good    |
| sql01 | 14669                   | 14.6ms       | Good    |
| sql02 | 16837                   | 16.8ms       | Average |
| sql03 | 20358                   | 20.3ms       | Average |
| sql03 | 21564                   | 21.5ms       | Average |
| sql01 | 22278                   | 22.2ms       | Average |
| sql01 | 25691                   | 25.6ms       | Worse   |
| sql03 | 29117                   | 29.1ms       | Worse   |
| sql03 | 35098                   | 35.0ms       | Worst   |

```

#### 3. mysql_server_galera_log (Monitor Internal Stalling)

*Massive discrepancy: Galera success times are up to 100x slower than TCP connects.*

```
| Hostname      | success_time_us | Latency (ms) | Status         |
|:--------------|:----------------|:-------------|:---------------|
| sql02 | 2464            | 2.4ms        | Good           |
| sql02 | 2946            | 2.9ms        | Good           |
| sql01 | 4823            | 4.8ms        | Good           |
| sql01 | 7195            | 7.1ms        | Good           |
| sql03 | 7637            | 7.6ms        | Good           |
| sql01 | 11422           | 11.4ms       | Average        |
| sql03 | 21440           | 21.4ms       | Average        |
| sql01 | 38058           | 38.0ms       | Worse          |
| sql03 | 47764           | 47.7ms       | Worse          |
| sql02 | 102062          | 102.0ms      | Critical Spike |
| sql02 | 106698          | 106.6ms      | Critical Spike |
| sql01 | **(Timeout)**   | **>1500ms**  | **Failed**     |
```

### Analysis & Conclusion

1. **Thread startvation**: Is there thread starvation? The immediate failure on startup suggests that ProxySQL threads are
   saturated
2. **Internal Bottleneck:** The discrepancy between pings (0.1ms) and Galera checks (>1500ms) indicates ProxySQL is failing to process results because worker threads are blocked.
3. **Potential Contention:** With `mysql-threads` set to 4, the initialization phase likely triggers mutex contention or CPU exhaustion, causing the `Watchdog` to fire and health checks to time out before the internal scheduler can even register the database's response which is weird cause we do not see any spikes in CPU usage at that time


**Request:** Please investigate potential lock contention in `MySQL_Monitor.cpp` during periods of high connection establishment, especially at startup and periods of higher load





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frequent Galera Health check timeouts #5624

Deployment Architecture & Traffic Flow

Connection Usage Context

Health Log Comparison

1. mysql_server_ping_log (Network Responsiveness)

2. mysql_server_connect_log (TCP Connection)

3. mysql_server_galera_log (Monitor Internal Stalling)

Analysis & Conclusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Frequent Galera Health check timeouts #5624

Description

Deployment Architecture & Traffic Flow

Connection Usage Context

Health Log Comparison

1. mysql_server_ping_log (Network Responsiveness)

2. mysql_server_connect_log (TCP Connection)

3. mysql_server_galera_log (Monitor Internal Stalling)

Analysis & Conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions