Skip to content

Issue with Cassandra Pods Restarting in Large Kubernetes Clusters #759

@garrynigel

Description

@garrynigel

What happened?

We have observed an issue within our internal Kubernetes clusters (which contain approximately 5000 pods and 400 nodes) where the Cassandra pods are continuously restarting and failing to come up. Specifically, the Cassandra pods are unable to start, although the management API process is running.

Upon further investigation, we found that when a pod starts, the cass-operator attempts to make a remote call to initiate the Cassandra process on the specific pod. However, the process fails to start.

To investigate further, I modified the code to log any errors before the pod is deleted. Here is the log that was captured:

2025-02-12T21:57:39.491Z        DEBUG   events  Starting Cassandra for pod cs-6a03e53f4b-cs-6a03e53f4b-default-sts-2    {"type": "Normal", "object": {"kind":"CassandraDatacenter","namespace":"apps","name":"cs-6a03e53f4b","uid":"8fa6708e-5dee-4c89-82c4-70293420b0f2","apiVersion":"cassandra.datastax.com/v1beta1","resourceVersion":"463825506"}, "reason": "StartingCassandra"}
2025-02-12T21:57:39.492Z        ERROR   Call to start Cassandra remotely failed.        {"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "b66fb680-15cd-4b47-b2e9-edd1bd20e232", "namespace": "apps", "datacenterName": "cs-6a03e53f4b", "clusterName": "cs-6a03e53f4b", "error": "Post \"http://172.20.68.35:8080/api/v0/lifecycle/start\": dial tcp 172.20.68.35:8080: connect: connection refused"}
github.com/k8ssandra/cass-operator/pkg/reconciliation.(*ReconciliationContext).startCassandra.func1
        /workspace/pkg/reconciliation/reconcile_racks.go:1965

The issue seems to be related to a minor timing problem and heavy load on the Kubernetes cluster. Even though the pod itself is up, the cass-operator is unable to connect and make the required request. You can see further details about the implementation here: ReconciliationContext.

To temporarily mitigate this issue, I added a retry mechanism with exponential backoff in the cass operator code. After several attempts, the cass-operator was eventually able to start the Cassandra process successfully on the pod.

2025-02-12T22:55:50.698Z	INFO	calling Management API start node - POST /api/v0/lifecycle/start	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "8c789763-9d9a-4316-8e8e-3a1166315b48", "pod": "cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3", "podIP": "172.21.70.11", "replaceIP": ""}
2025-02-12T22:55:50.698Z	DEBUG	events	Starting Cassandra for pod cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3	{"type": "Normal", "object": {"kind":"CassandraDatacenter","namespace":"apps","name":"cs-6a03e53f4b","uid":"8fa6708e-5dee-4c89-82c4-70293420b0f2","apiVersion":"cassandra.datastax.com/v1beta1","resourceVersion":"463938256"}, "reason": "StartingCassandra"}
2025-02-12T22:55:51.296Z	INFO	Starting Cassandra for pod cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "8c789763-9d9a-4316-8e8e-3a1166315b48", "reason": "StartingCassandra", "eventType": "Normal"}
2025-02-12T22:55:51.296Z	INFO	calling Management API start node - POST /api/v0/lifecycle/start	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "8c789763-9d9a-4316-8e8e-3a1166315b48", "pod": "cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3", "podIP": "172.21.70.11", "replaceIP": ""}
2025-02-12T22:55:51.296Z	DEBUG	events	Starting Cassandra for pod cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3	{"type": "Normal", "object": {"kind":"CassandraDatacenter","namespace":"apps","name":"cs-6a03e53f4b","uid":"8fa6708e-5dee-4c89-82c4-70293420b0f2","apiVersion":"cassandra.datastax.com/v1beta1","resourceVersion":"464085205"}, "reason": "StartingCassandra"}
2025-02-12T22:55:51.806Z	INFO	Starting Cassandra for pod cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "8c789763-9d9a-4316-8e8e-3a1166315b48", "reason": "StartingCassandra", "eventType": "Normal"}
2025-02-12T22:55:51.806Z	DEBUG	events	Starting Cassandra for pod cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3	{"type": "Normal", "object": {"kind":"CassandraDatacenter","namespace":"apps","name":"cs-6a03e53f4b","uid":"8fa6708e-5dee-4c89-82c4-70293420b0f2","apiVersion":"cassandra.datastax.com/v1beta1","resourceVersion":"464085205"}, "reason": "StartingCassandra"}
2025-02-12T22:55:51.806Z	INFO	calling Management API start node - POST /api/v0/lifecycle/start	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "8c789763-9d9a-4316-8e8e-3a1166315b48", "pod": "cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3", "podIP": "172.21.70.11", "replaceIP": ""}
2025-02-12T22:55:53.012Z	INFO	Starting Cassandra for pod cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "8c789763-9d9a-4316-8e8e-3a1166315b48", "reason": "StartingCassandra", "eventType": "Normal"}
2025-02-12T22:55:53.012Z	INFO	calling Management API start node - POST /api/v0/lifecycle/start	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "8c789763-9d9a-4316-8e8e-3a1166315b48", "pod": "cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3", "podIP": "172.21.70.11", "replaceIP": ""}
2025-02-12T22:55:53.012Z	DEBUG	events	Starting Cassandra for pod cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3	{"type": "Normal", "object": {"kind":"CassandraDatacenter","namespace":"apps","name":"cs-6a03e53f4b","uid":"8fa6708e-5dee-4c89-82c4-70293420b0f2","apiVersion":"cassandra.datastax.com/v1beta1","resourceVersion":"464085205"}, "reason": "StartingCassandra"}
2025-02-12T22:55:55.532Z	INFO	Starting Cassandra for pod cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "8c789763-9d9a-4316-8e8e-3a1166315b48", "reason": "StartingCassandra", "eventType": "Normal"}
2025-02-12T22:55:55.532Z	INFO	calling Management API start node - POST /api/v0/lifecycle/start	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "8c789763-9d9a-4316-8e8e-3a1166315b48", "pod": "cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3", "podIP": "172.21.70.11", "replaceIP": ""}

Could you kindly advise if this retry mechanism is an appropriate fix for the problem, or if there is a potential issue with how the Cassandra pod is being marked as ready to accept API requests from the operator?

What did you expect to happen?

No response

How can we reproduce it (as minimally and precisely as possible)?

These issues have been intermittent; however, under conditions of heavy load on the Kubernetes cluster, we have observed a higher frequency of occurrences. Not sure how to reproduce locally.

cass-operator version

1.22.4

Kubernetes version

1.30.8

Method of installation

helm

Anything else we need to know?

No response

┆Issue is synchronized with this Jira Story by Unito
┆Issue Number: CASS-92

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions