Skip to content

Commit 6538475

Browse files
authored
CLOUDP-350185 Fix flaky e2e_multi_cluster_sharded_snippets test (#503)
# Fix flaky e2e_multi_cluster_sharded_snippets test ## Problem The `e2e_multi_cluster_sharded_snippets` test fails intermittently when the Kubernetes API server times out during resource creation. Example run: https://spruce.mongodb.com/task/mongodb_kubernetes_e2e_multi_cluster_kind_e2e_multi_cluster_sharded_snippets_12f405afd0f823091430f0be8f4ac21d87a9559c_25_10_05_20_58_10/files?execution=0&sorts=STATUS%3AASC **What I noticed in my investigation:** 1. Test deploys 5 sharded MongoDB clusters simultaneously (~75-100 services across 3 clusters) 2. Around 7-8 minutes in, K8s API server times out on a service update operation 3. Operator marks resource as Failed with error: `"the server was unable to return a response in the time allotted, but may still be processing the request"` 4. Test immediately fails 5. Minutes later, the resource actually reaches Running (the timeout was transient) ## Investigation - Operator creates hundreds of K8s API operations during reconciliation - This overloads the kind cluster's API server - K8s API timeouts are transient - services and pods are created successfully, just slower than expected - After being marked Failed, resources recover within 4-5 minutes ## This Fix Add K8s API timeout patterns to the `intermediate_events` list in `mongodb.py`: - `"but may still be processing the request"` (server-side timeout) - `"Client.Timeout exceeded while awaiting headers"` (client-side timeout) **Effect:** - When operator marks resource as Failed with K8s API timeout error, test skips the failure - Test continues waiting for resource to reach Running - Test passes once resource recovers (which it does) This is the same pattern used for other transient failures like agent registration timeouts and Ops Manager connection issues. ## Proper Fix (Future Work) The operator should not mark resources as Failed on K8s API timeout. Instead, for example: 1. Detect K8s API timeout errors 2. Retry with exponential backoff 3. Only mark Failed after multiple consecutive timeouts ## Proof of work Ran 4 patches to check for flakiness after the fix: 1. [Patch 1](https://spruce.mongodb.com/task/mongodb_kubernetes_e2e_multi_cluster_kind_e2e_multi_cluster_sharded_snippets_patch_c5839ff5b5d5b5b1e338b476c9d299513525f506_68e6597888fa050007a68f9e_25_10_08_12_30_49/logs?execution=0) 2. [Patch 2](https://spruce.mongodb.com/task/mongodb_kubernetes_e2e_multi_cluster_kind_e2e_multi_cluster_sharded_snippets_patch_c5839ff5b5d5b5b1e338b476c9d299513525f506_68e6597ca47d640007870576_25_10_08_12_30_53/logs?execution=0) 3. [Patch 3](https://spruce.mongodb.com/task/mongodb_kubernetes_e2e_multi_cluster_kind_e2e_multi_cluster_sharded_snippets_patch_c5839ff5b5d5b5b1e338b476c9d299513525f506_68e6598068612800074a09bc_25_10_08_12_30_58/logs?execution=0) 4. [Patch 4](https://spruce.mongodb.com/task/mongodb_kubernetes_e2e_multi_cluster_kind_e2e_multi_cluster_sharded_snippets_patch_c5839ff5b5d5b5b1e338b476c9d299513525f506_68e6598bb1a26200071edb2c_25_10_08_12_31_08/logs?execution=0) All patches reached Running despite intermediate failures like: ``` [2025/10/08 15:03:38.199] DEBUG 2025-10-08 13:03:38,198 [mongodb_utils_state] Found intermediate event in failure: Client.Timeout exceeded while awaiting headers in Failed to create configmap: a-1759927824-grtlr6pj55z/pod-template-shards-0-hostname-override in cluster: kind-e2e-cluster-1, err: Put "https://10.97.0.1/api/v1/namespaces/a-1759927824-grtlr6pj55z/configmaps/pod-template-shards-0-hostname-override?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers). Skipping the failure state ``` The test now properly skips these transient API timeout failures and waits for resources to recover. `backup_minio` tests are failing, but in many other branches too
1 parent c45bc73 commit 6538475

File tree

1 file changed

+5
-0
lines changed
  • docker/mongodb-kubernetes-tests/kubetester

1 file changed

+5
-0
lines changed

docker/mongodb-kubernetes-tests/kubetester/mongodb.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,11 @@ def assert_reaches_phase(self, phase: Phase, msg_regexp=None, timeout=None, igno
8989
"Failed to enable Authentication",
9090
# Sometimes agents need longer to register with OM.
9191
"some agents failed to register or the Operator",
92+
# Kubernetes API server may timeout during high load, but the request may still complete.
93+
# This is particularly common when deploying many resources simultaneously.
94+
"but may still be processing the request",
95+
"Client.Timeout exceeded while awaiting headers",
96+
"context deadline exceeded",
9297
)
9398

9499
start_time = time.time()

0 commit comments

Comments
 (0)