-
Notifications
You must be signed in to change notification settings - Fork 134
Open
Description
Describe the bug
Kafka pod is OOMKilled during the init of the astronomy-shop, causing the entire deployment to fail with a timeout after 300 seconds.
To Reproduce
Steps to reproduce the behavior:
poetry run python clients/openrouter.py- when we run the case, kafka will OOMKilled
kubectl get pods --all-namespaces | grep -v Running
NAMESPACE NAME READY STATUS RESTARTS AGE
astronomy-shop accounting-567f87bbcd-6qg59 0/1 Init:0/1 0 7m21s
astronomy-shop ad-856c77f595-v875m 0/1 CrashLoopBackOff 6 (72s ago) 7m19s
astronomy-shop checkout-5bc54f8cd8-h8j4z 0/1 Init:0/1 0 7m21s
astronomy-shop fraud-detection-65868bcdb5-jgl9w 0/1 Init:0/1 0 7m19s
astronomy-shop kafka-595d889c56-l2ks9 0/1 OOMKilled 6 (3m7s ago) 7m22s
default wrk2-job-tqj7r 0/1 Completed 0 23m
The result is timeout after 300s:
- All services are available via the Frontend proxy: http://localhost:8080
by running these commands:
kubectl --namespace astronomy-shop port-forward svc/frontend-proxy 8080:8080
The following services are available at these paths after the frontend-proxy service is exposed with port forwarding:
Webstore http://localhost:8080/
Jaeger UI http://localhost:8080/jaeger/ui/
Grafana http://localhost:8080/grafana/
Load Generator UI http://localhost:8080/loadgen/
Feature Flags UI http://localhost:8080/feature/
[06:19:12] Waiting for all pods in namespace 'astronomy-shop' to be ready... kubectl.py:74
Traceback (most recent call last):
File "/home/user/leo.li/AIOpsLab/clients/openrouter.py", line 216, in <module>
problem_desc, instructs, apis = orchestrator.init_problem(pid)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/leo.li/AIOpsLab/aiopslab/orchestrator/orchestrator.py", line 72, in init_problem
prob.app.deploy()
File "/home/user/leo.li/AIOpsLab/aiopslab/service/apps/astronomy_shop.py", line 32, in deploy
Helm.assert_if_deployed(self.helm_configs["namespace"])
File "/home/user/leo.li/AIOpsLab/aiopslab/service/helm.py", line 123, in assert_if_deployed
raise e
File "/home/user/leo.li/AIOpsLab/aiopslab/service/helm.py", line 121, in assert_if_deployed
kubectl.wait_for_ready(namespace)
File "/home/user/leo.li/AIOpsLab/aiopslab/service/kubectl.py", line 100, in wait_for_ready
raise Exception(f"[red]Timeout: Not all pods in namespace '{namespace}' reached the Ready state within {max_wait} seconds.")
Exception: [red]Timeout: Not all pods in namespace 'astronomy-shop' reached the Ready state within 300 seconds.
Expected behavior
After I manually increase the limit memory for kafka, it works:
== Helm Install ==
NAME: astronomy-shop
LAST DEPLOYED: Mon Aug 25 06:59:03 2025
NAMESPACE: astronomy-shop
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
=======================================================================================
██████╗ ████████╗███████╗██╗ ██████╗ ███████╗███╗ ███╗ ██████╗
██╔═══██╗╚══██╔══╝██╔════╝██║ ██╔══██╗██╔════╝████╗ ████║██╔═══██╗
██║ ██║ ██║ █████╗ ██║ ██║ ██║█████╗ ██╔████╔██║██║ ██║
██║ ██║ ██║ ██╔══╝ ██║ ██║ ██║██╔══╝ ██║╚██╔╝██║██║ ██║
╚██████╔╝ ██║ ███████╗███████╗ ██████╔╝███████╗██║ ╚═╝ ██║╚██████╔╝
╚═════╝ ╚═╝ ╚══════╝╚══════╝ ╚═════╝ ╚══════╝╚═╝ ╚═╝ ╚═════╝
- All services are available via the Frontend proxy: http://localhost:8080
by running these commands:
kubectl --namespace astronomy-shop port-forward svc/frontend-proxy 8080:8080
The following services are available at these paths after the frontend-proxy service is exposed with port forwarding:
Webstore http://localhost:8080/
Jaeger UI http://localhost:8080/jaeger/ui/
Grafana http://localhost:8080/grafana/
Load Generator UI http://localhost:8080/loadgen/
Feature Flags UI http://localhost:8080/feature/
[06:59:05] Waiting for all pods in namespace 'astronomy-shop' to be ready... kubectl.py:83
[06:59:46] All pods in namespace 'astronomy-shop' are ready. kubectl.py:100
== Fault Injection ==
ConfigMap 'flagd-config' updated in namespace 'astronomy-shop'
Fault injected: Feature flag 'adFailure' set to 'on'.
Fault: adServiceFailure | Namespace: astronomy-shop
== Start Workload ==
Workload skipped since AstronomyShop has a built-in load generator.
===== Agent (OpenRouter - qwen/qwen3-30b-a3b-instruct-2507) ====
is it appropriate to increase the limit directly in https://github.com/xlab-uiuc/opentelemetry-helm-charts/blob/68d88623f480687a0a50dcd7ca1ff3340bb1ff7a/charts/opentelemetry-demo/values.yaml like this?
diff --git a/charts/opentelemetry-demo/values.yaml b/charts/opentelemetry-demo/values.yaml
index b09ccbd3..77732a02 100644
--- a/charts/opentelemetry-demo/values.yaml
+++ b/charts/opentelemetry-demo/values.yaml
@@ -648,10 +648,12 @@ components:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: http://$(OTEL_COLLECTOR_NAME):4318
- name: KAFKA_HEAP_OPTS
- value: "-Xmx400M -Xms400M"
+ value: "-Xmx512M -Xms512M"
resources:
limits:
- memory: 600Mi
+ memory: 1Gi
Metadata
Metadata
Assignees
Labels
No labels