Issue Summary
ProxySQL intermittently marks healthy Galera nodes as OFFLINE due to health check timeouts. This behavior is most consistent immediately after ProxySQL startup, where we frequently see >5 timeouts within the first few seconds. And during traffic this keeps on happening a lot more than we expect.
The mysql_server_ping_log and mysql_server_connect_log confirm the network and MySQL listeners are highly responsive (latencies in microseconds). However, the mysql_server_galera_log shows success times spiking into hundreds of milliseconds or failing entirely. This, combined with the Watchdog error, points to internal thread starvation rather than actual backend database issues.
Observed Errors
2026-04-15 04:00:46 MySQL_Monitor.cpp:2633:monitor_galera_thread(): [ERROR] Got error. mmsd 0x7fd94c210200 , MYSQL 0x7fd94b3f2d00 , FD 75 : timeout check
Server sql01:3306 missed 3 Galera checks. Assuming offline
Timeout on Galera health check for svevert8sql01:3306 after 1507ms.
Watchdog: 4 MySQL threads missed a heartbeat
Environment
ProxySQL Version: 3.0.5-centos
Replicas: 3
cpu cores requests: 4 per instance
memory allocation: 3 Gi per instance
(resource usage is very stable and well within bounds)
Deployment Architecture & Traffic Flow
- Traffic Pattern: We operate a single-writer topology. 100% of client traffic is directed to
sql01
- Replication: Nodes
sql02 and sql03 are maintained strictly for synchronous replication and failover purposes; they do not handle active client queries under normal conditions.
- Impact: Because all traffic is localized to Node 1, any false-positive "OFFLINE" status triggered by the monitor causes an unnecessary and disruptive cluster failover.
Connection Usage Context
Based on our current metrics:
- Client Connections: ~1,300 active connections per ProxySQL instance.
- Backend Connections (In-Use): Peaks of 300 to 700 connections.
- Multiplexing: false (due to use of transaction, medium-long)
Relevant Configuration
We have adjusted several monitor and connection timeouts to mitigate this, but the issue persists on startup:
| Variable | Value | Description |
|:-------------------------------------------|:----------|:---------------------------------------------------|
| `mysql-threads` | **4** | Number of worker threads. |
| `mysql-monitor_threads_min` | **4** | Min threads for monitoring tasks. |
| `mysql-monitor_threads_max` | **128** | Max threads for monitoring tasks. |
| `mysql-monitor_galera_healthcheck_timeout` | **1500** | Max time for a Galera health check (ms). |
| `mysql-monitor_connect_timeout` | **1000** | Timeout for monitor connection establishment (ms). |
| `mysql-monitor_ping_timeout` | **1500** | Timeout for monitor ICMP/Ping (ms). |
| `mysql-monitor_read_only_timeout` | **1000** | Timeout for read-only checks (ms). |
| `mysql-monitor_query_timeout` | **1000** | Timeout for monitor queries (ms). |
| `mysql-connect_timeout_server` | **1000** | Max TCP connection timeout to backends (ms). |
| `mysql-connect_timeout_server_max` | **10000** | Max TCP connection timeout to backends (ms). |
| `mysql-poll_timeout` | **500** | Timeout for internal poll() calls (ms). |
Health Log Comparison
1. mysql_server_ping_log (Network Responsiveness)
The network is clear; backends respond in microseconds.
| Hostname | success_time_us | Latency (ms) | Status |
|:--------------|:----------------|:-------------|:----------|
| sql02 | 133 | 0.13ms | Excellent |
| sql01 | 142 | 0.14ms | Excellent |
| sql01 | 215 | 0.21ms | Good |
| sql02 | 233 | 0.23ms | Good |
| sql01 | 413 | 0.41ms | Good |
| sql01 | 449 | 0.44ms | Good |
| sql03 | 1381 | 1.38ms | Average |
| sql03 | 1470 | 1.47ms | Average |
| sql03 | 1513 | 1.51ms | Average |
| sql03 | 1761 | 1.76ms | Worse |
| sql03 | 2002 | 2.00ms | Worse |
| sql03 | 2672 | 2.67ms | Worst |
2. mysql_server_connect_log (TCP Connection)
TCP handshake remains stable (13ms - 35ms).
| Hostname | connect_success_time_us | Latency (ms) | Status |
|:--------------|:------------------------|:-------------|:--------|
| sql01 | 13060 | 13.0ms | Good |
| sql02 | 13116 | 13.1ms | Good |
| sql01 | 13256 | 13.2ms | Good |
| sql02 | 13388 | 13.3ms | Good |
| sql01 | 14669 | 14.6ms | Good |
| sql02 | 16837 | 16.8ms | Average |
| sql03 | 20358 | 20.3ms | Average |
| sql03 | 21564 | 21.5ms | Average |
| sql01 | 22278 | 22.2ms | Average |
| sql01 | 25691 | 25.6ms | Worse |
| sql03 | 29117 | 29.1ms | Worse |
| sql03 | 35098 | 35.0ms | Worst |
3. mysql_server_galera_log (Monitor Internal Stalling)
Massive discrepancy: Galera success times are up to 100x slower than TCP connects.
| Hostname | success_time_us | Latency (ms) | Status |
|:--------------|:----------------|:-------------|:---------------|
| sql02 | 2464 | 2.4ms | Good |
| sql02 | 2946 | 2.9ms | Good |
| sql01 | 4823 | 4.8ms | Good |
| sql01 | 7195 | 7.1ms | Good |
| sql03 | 7637 | 7.6ms | Good |
| sql01 | 11422 | 11.4ms | Average |
| sql03 | 21440 | 21.4ms | Average |
| sql01 | 38058 | 38.0ms | Worse |
| sql03 | 47764 | 47.7ms | Worse |
| sql02 | 102062 | 102.0ms | Critical Spike |
| sql02 | 106698 | 106.6ms | Critical Spike |
| sql01 | **(Timeout)** | **>1500ms** | **Failed** |
Analysis & Conclusion
- Thread startvation: Is there thread starvation? The immediate failure on startup suggests that ProxySQL threads are
saturated
- Internal Bottleneck: The discrepancy between pings (0.1ms) and Galera checks (>1500ms) indicates ProxySQL is failing to process results because worker threads are blocked.
- Potential Contention: With
mysql-threads set to 4, the initialization phase likely triggers mutex contention or CPU exhaustion, causing the Watchdog to fire and health checks to time out before the internal scheduler can even register the database's response which is weird cause we do not see any spikes in CPU usage at that time
Request: Please investigate potential lock contention in MySQL_Monitor.cpp during periods of high connection establishment, especially at startup and periods of higher load
Issue Summary
ProxySQL intermittently marks healthy Galera nodes as
OFFLINEdue to health check timeouts. This behavior is most consistent immediately after ProxySQL startup, where we frequently see >5 timeouts within the first few seconds. And during traffic this keeps on happening a lot more than we expect.The
mysql_server_ping_logandmysql_server_connect_logconfirm the network and MySQL listeners are highly responsive (latencies in microseconds). However, the mysql_server_galera_log shows success times spiking into hundreds of milliseconds or failing entirely. This, combined with the Watchdog error, points to internal thread starvation rather than actual backend database issues.Observed Errors
Environment
ProxySQL Version: 3.0.5-centos
Replicas: 3
cpu cores requests: 4 per instance
memory allocation: 3 Gi per instance
(resource usage is very stable and well within bounds)
Deployment Architecture & Traffic Flow
sql01sql02andsql03are maintained strictly for synchronous replication and failover purposes; they do not handle active client queries under normal conditions.Connection Usage Context
Based on our current metrics:
Relevant Configuration
We have adjusted several monitor and connection timeouts to mitigate this, but the issue persists on startup:
Health Log Comparison
1. mysql_server_ping_log (Network Responsiveness)
The network is clear; backends respond in microseconds.
2. mysql_server_connect_log (TCP Connection)
TCP handshake remains stable (13ms - 35ms).
3. mysql_server_galera_log (Monitor Internal Stalling)
Massive discrepancy: Galera success times are up to 100x slower than TCP connects.
Analysis & Conclusion
saturated
mysql-threadsset to 4, the initialization phase likely triggers mutex contention or CPU exhaustion, causing theWatchdogto fire and health checks to time out before the internal scheduler can even register the database's response which is weird cause we do not see any spikes in CPU usage at that timeRequest: Please investigate potential lock contention in
MySQL_Monitor.cppduring periods of high connection establishment, especially at startup and periods of higher load