Skip to content

Helix Reconnect Issue #3064

@jasondhoyt

Description

@jasondhoyt

Describe the bug

We have a service that uses Helix (1.4.0) and it has been experiencing trouble when a network connection gets interrupted for the local helix agent on individual nodes within a cluster. Our helix agent processes are not always re-connecting and that has caused some issues with the service as a whole where we have to take more drastic steps in order to remediate the issue. Ideally the local helix agent should be able to reconnect properly but it is not. The exceptions below seem to be happening internal to the helix Java library. We are also monitoring the HelixManager.isConnected() method but that does not seem to be returning false in these circumstances. If it was, our local agent would automatically restart which would have resolved the issue. Is there any insight as to why the helix library is not re-connecting properly or else not reporting that bad connection? Is this a potential bug within the helix library?

2025-07-22 13:08:37,350 [main-SendThread(10.0.0.1:2181)] WARN org.apache.zookeeper.ClientCnxn:1257 - Client session timed out, have not heard from server in 40003ms for session id 0x1d0000039b0f05e7
2025-07-22 13:08:37,353 [main-SendThread(10.0.0.1:2181)] WARN org.apache.zookeeper.ClientCnxn:1300 - Session 0x1d0000039b0f05e7 for sever 10.0.0.1/10.0.0.1:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 40003ms for session id 0x1d0000039b0f05e7
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1258) [?:?]
2025-07-22 13:08:37,472 [main-EventThread] INFO org.apache.helix.zookeeper.zkclient.ZkClient:1561 - zkclient 2, zookeeper state changed ( Disconnected )
2025-07-22 13:08:37,483 [ZkClient-EventThread-47-10.0.0.3:2181,10.0.0.2:2181,10.0.0.4:2181,10.0.0.5:2181,10.0.0.1:2181] WARN org.apache.helix.manager.zk.ZKHelixManager:1272 - KeeperState:Disconnected, SessionId: 1d0000039b0f05e7, instance: 10.0.0.6_1, type: PARTICIPANT
2025-07-22 13:08:53,160 [main-SendThread(10.0.0.2:2181)] INFO org.apache.zookeeper.ClientCnxn:1181 - Opening socket connection to server 10.0.0.2/10.0.0.2:2181.
2025-07-22 13:08:53,160 [main-SendThread(10.0.0.2:2181)] INFO org.apache.zookeeper.ClientCnxn:1183 - SASL config status: Will not attempt to authenticate using SASL (unknown error)
2025-07-22 13:09:05,176 [main-SendThread(10.0.0.2:2181)] WARN org.apache.zookeeper.ClientCnxn:1257 - Client session timed out, have not heard from server in 27709ms for session id 0x1d0000039b0f05e7
2025-07-22 13:09:05,176 [main-SendThread(10.0.0.2:2181)] WARN org.apache.zookeeper.ClientCnxn:1300 - Session 0x1d0000039b0f05e7 for sever 10.0.0.2/10.0.0.2:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 27709ms for session id 0x1d0000039b0f05e7
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1258) [?:?]
2025-07-22 13:09:10,520 [main-SendThread(10.0.0.3:2181)] INFO org.apache.zookeeper.ClientCnxn:1181 - Opening socket connection to server 10.0.0.3/10.0.0.3:2181.
2025-07-22 13:09:10,522 [main-SendThread(10.0.0.3:2181)] INFO org.apache.zookeeper.ClientCnxn:1183 - SASL config status: Will not attempt to authenticate using SASL (unknown error)
2025-07-22 13:09:10,523 [main-SendThread(10.0.0.3:2181)] INFO org.apache.zookeeper.ClientCnxn:1013 - Socket connection established, initiating session, client: /10.0.0.6:43388, server: 10.0.0.3/10.0.0.3:2181
2025-07-22 13:09:10,546 [main-EventThread] INFO org.apache.helix.zookeeper.zkclient.ZkClient:1561 - zkclient 2, zookeeper state changed ( Expired )
2025-07-22 13:09:10,546 [ZkClient-EventThread-47-10.0.0.3:2181,10.0.0.2:2181,10.0.0.4:2181,10.0.0.5:2181,10.0.0.1:2181] WARN org.apache.helix.manager.zk.ZKHelixManager:1272 - KeeperState:Expired, SessionId: 1d0000039b0f05e7, instance: 10.0.0.6_1, type: PARTICIPANT
2025-07-22 13:09:10,546 [main-SendThread(10.0.0.3:2181)] WARN org.apache.zookeeper.ClientCnxn:1433 - Unable to reconnect to ZooKeeper service, session 0x1d0000039b0f05e7 has expired
2025-07-22 13:09:10,546 [main-SendThread(10.0.0.3:2181)] WARN org.apache.zookeeper.ClientCnxn:1300 - Session 0x1d0000039b0f05e7 for sever 10.0.0.3/10.0.0.3:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$SessionExpiredException: Unable to reconnect to ZooKeeper service, session 0x1d0000039b0f05e7 has expired
        at org.apache.zookeeper.ClientCnxn$SendThread.onConnected(ClientCnxn.java:1434) ~[?:?]
        at org.apache.zookeeper.ClientCnxnSocket.readConnectResult(ClientCnxnSocket.java:154) ~[?:?]
        at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:86) ~[?:?]
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) ~[?:?]
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1290) [?:?]

To Reproduce

Cause a network interruption between the Helix agent and the Zookeeper service. Helix attempts to reconnect but does not report this connection failure back up through HelixManager.

Expected behavior

The HelixManager.isConnected() method should return false if there is a connection issue with the Zookeeper service.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions