We have observed an issue within our internal Kubernetes clusters (which contain approximately 5000 pods and 400 nodes) where the Cassandra pods are continuously restarting and failing to come up. Specifically, the Cassandra pods are unable to start, although the management API process is running.
Upon further investigation, we found that when a pod starts, the cass-operator attempts to make a remote call to initiate the Cassandra process on the specific pod. However, the process fails to start.
To investigate further, I modified the code to log any errors before the pod is deleted. Here is the log that was captured:
2025-02-12T21:57:39.491Z DEBUG events Starting Cassandra for pod cs-6a03e53f4b-cs-6a03e53f4b-default-sts-2 {"type": "Normal", "object": {"kind":"CassandraDatacenter","namespace":"apps","name":"cs-6a03e53f4b","uid":"8fa6708e-5dee-4c89-82c4-70293420b0f2","apiVersion":"cassandra.datastax.com/v1beta1","resourceVersion":"463825506"}, "reason": "StartingCassandra"}
2025-02-12T21:57:39.492Z ERROR Call to start Cassandra remotely failed. {"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "b66fb680-15cd-4b47-b2e9-edd1bd20e232", "namespace": "apps", "datacenterName": "cs-6a03e53f4b", "clusterName": "cs-6a03e53f4b", "error": "Post \"http://172.20.68.35:8080/api/v0/lifecycle/start\": dial tcp 172.20.68.35:8080: connect: connection refused"}
github.com/k8ssandra/cass-operator/pkg/reconciliation.(*ReconciliationContext).startCassandra.func1
/workspace/pkg/reconciliation/reconcile_racks.go:1965
The issue seems to be related to a minor timing problem and heavy load on the Kubernetes cluster. Even though the pod itself is up, the cass-operator is unable to connect and make the required request. You can see further details about the implementation here: ReconciliationContext.
To temporarily mitigate this issue, I added a retry mechanism with exponential backoff in the cass operator code. After several attempts, the cass-operator was eventually able to start the Cassandra process successfully on the pod.
2025-02-12T22:55:50.698Z INFO calling Management API start node - POST /api/v0/lifecycle/start {"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "8c789763-9d9a-4316-8e8e-3a1166315b48", "pod": "cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3", "podIP": "172.21.70.11", "replaceIP": ""}
2025-02-12T22:55:50.698Z DEBUG events Starting Cassandra for pod cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3 {"type": "Normal", "object": {"kind":"CassandraDatacenter","namespace":"apps","name":"cs-6a03e53f4b","uid":"8fa6708e-5dee-4c89-82c4-70293420b0f2","apiVersion":"cassandra.datastax.com/v1beta1","resourceVersion":"463938256"}, "reason": "StartingCassandra"}
2025-02-12T22:55:51.296Z INFO Starting Cassandra for pod cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3 {"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "8c789763-9d9a-4316-8e8e-3a1166315b48", "reason": "StartingCassandra", "eventType": "Normal"}
2025-02-12T22:55:51.296Z INFO calling Management API start node - POST /api/v0/lifecycle/start {"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "8c789763-9d9a-4316-8e8e-3a1166315b48", "pod": "cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3", "podIP": "172.21.70.11", "replaceIP": ""}
2025-02-12T22:55:51.296Z DEBUG events Starting Cassandra for pod cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3 {"type": "Normal", "object": {"kind":"CassandraDatacenter","namespace":"apps","name":"cs-6a03e53f4b","uid":"8fa6708e-5dee-4c89-82c4-70293420b0f2","apiVersion":"cassandra.datastax.com/v1beta1","resourceVersion":"464085205"}, "reason": "StartingCassandra"}
2025-02-12T22:55:51.806Z INFO Starting Cassandra for pod cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3 {"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "8c789763-9d9a-4316-8e8e-3a1166315b48", "reason": "StartingCassandra", "eventType": "Normal"}
2025-02-12T22:55:51.806Z DEBUG events Starting Cassandra for pod cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3 {"type": "Normal", "object": {"kind":"CassandraDatacenter","namespace":"apps","name":"cs-6a03e53f4b","uid":"8fa6708e-5dee-4c89-82c4-70293420b0f2","apiVersion":"cassandra.datastax.com/v1beta1","resourceVersion":"464085205"}, "reason": "StartingCassandra"}
2025-02-12T22:55:51.806Z INFO calling Management API start node - POST /api/v0/lifecycle/start {"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "8c789763-9d9a-4316-8e8e-3a1166315b48", "pod": "cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3", "podIP": "172.21.70.11", "replaceIP": ""}
2025-02-12T22:55:53.012Z INFO Starting Cassandra for pod cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3 {"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "8c789763-9d9a-4316-8e8e-3a1166315b48", "reason": "StartingCassandra", "eventType": "Normal"}
2025-02-12T22:55:53.012Z INFO calling Management API start node - POST /api/v0/lifecycle/start {"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "8c789763-9d9a-4316-8e8e-3a1166315b48", "pod": "cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3", "podIP": "172.21.70.11", "replaceIP": ""}
2025-02-12T22:55:53.012Z DEBUG events Starting Cassandra for pod cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3 {"type": "Normal", "object": {"kind":"CassandraDatacenter","namespace":"apps","name":"cs-6a03e53f4b","uid":"8fa6708e-5dee-4c89-82c4-70293420b0f2","apiVersion":"cassandra.datastax.com/v1beta1","resourceVersion":"464085205"}, "reason": "StartingCassandra"}
2025-02-12T22:55:55.532Z INFO Starting Cassandra for pod cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3 {"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "8c789763-9d9a-4316-8e8e-3a1166315b48", "reason": "StartingCassandra", "eventType": "Normal"}
2025-02-12T22:55:55.532Z INFO calling Management API start node - POST /api/v0/lifecycle/start {"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "8c789763-9d9a-4316-8e8e-3a1166315b48", "pod": "cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3", "podIP": "172.21.70.11", "replaceIP": ""}
Could you kindly advise if this retry mechanism is an appropriate fix for the problem, or if there is a potential issue with how the Cassandra pod is being marked as ready to accept API requests from the operator?
These issues have been intermittent; however, under conditions of heavy load on the Kubernetes cluster, we have observed a higher frequency of occurrences. Not sure how to reproduce locally.
What happened?
We have observed an issue within our internal Kubernetes clusters (which contain approximately 5000 pods and 400 nodes) where the Cassandra pods are continuously restarting and failing to come up. Specifically, the Cassandra pods are unable to start, although the management API process is running.
Upon further investigation, we found that when a pod starts, the cass-operator attempts to make a remote call to initiate the Cassandra process on the specific pod. However, the process fails to start.
To investigate further, I modified the code to log any errors before the pod is deleted. Here is the log that was captured:
The issue seems to be related to a minor timing problem and heavy load on the Kubernetes cluster. Even though the pod itself is up, the cass-operator is unable to connect and make the required request. You can see further details about the implementation here: ReconciliationContext.
To temporarily mitigate this issue, I added a retry mechanism with exponential backoff in the cass operator code. After several attempts, the cass-operator was eventually able to start the Cassandra process successfully on the pod.
Could you kindly advise if this retry mechanism is an appropriate fix for the problem, or if there is a potential issue with how the Cassandra pod is being marked as ready to accept API requests from the operator?
What did you expect to happen?
No response
How can we reproduce it (as minimally and precisely as possible)?
These issues have been intermittent; however, under conditions of heavy load on the Kubernetes cluster, we have observed a higher frequency of occurrences. Not sure how to reproduce locally.
cass-operator version
1.22.4
Kubernetes version
1.30.8
Method of installation
helm
Anything else we need to know?
No response
┆Issue is synchronized with this Jira Story by Unito
┆Issue Number: CASS-92