Skip to content

Kafka OOM in OpenTelemetry demo - memory configuration insufficent #59

@iFurySt

Description

@iFurySt

Describe the bug
Kafka pod is OOMKilled during the init of the astronomy-shop, causing the entire deployment to fail with a timeout after 300 seconds.

To Reproduce
Steps to reproduce the behavior:

  1. poetry run python clients/openrouter.py
  2. when we run the case, kafka will OOMKilled
kubectl get pods --all-namespaces | grep -v Running
NAMESPACE            NAME                                                       READY   STATUS           RESTARTS       AGE
astronomy-shop       accounting-567f87bbcd-6qg59                                0/1     Init:0/1           0              7m21s
astronomy-shop       ad-856c77f595-v875m                                        0/1 CrashLoopBackOff   6 (72s ago)    7m19s
astronomy-shop       checkout-5bc54f8cd8-h8j4z                                  0/1     Init:0/1           0              7m21s
astronomy-shop       fraud-detection-65868bcdb5-jgl9w                           0/1     Init:0/1           0              7m19s
astronomy-shop       kafka-595d889c56-l2ks9                                     0/1 OOMKilled          6 (3m7s ago)   7m22s
default              wrk2-job-tqj7r                                             0/1 Completed          0              23m

The result is timeout after 300s:

- All services are available via the Frontend proxy: http://localhost:8080
  by running these commands:
     kubectl --namespace astronomy-shop port-forward svc/frontend-proxy 8080:8080

  The following services are available at these paths after the frontend-proxy service is exposed with port forwarding:
  Webstore             http://localhost:8080/
  Jaeger UI            http://localhost:8080/jaeger/ui/
  Grafana              http://localhost:8080/grafana/
  Load Generator UI    http://localhost:8080/loadgen/
  Feature Flags UI     http://localhost:8080/feature/

[06:19:12] Waiting for all pods in namespace 'astronomy-shop' to be ready...                                    kubectl.py:74
Traceback (most recent call last):
  File "/home/user/leo.li/AIOpsLab/clients/openrouter.py", line 216, in <module>
    problem_desc, instructs, apis = orchestrator.init_problem(pid)
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/leo.li/AIOpsLab/aiopslab/orchestrator/orchestrator.py", line 72, in init_problem
    prob.app.deploy()
  File "/home/user/leo.li/AIOpsLab/aiopslab/service/apps/astronomy_shop.py", line 32, in deploy
    Helm.assert_if_deployed(self.helm_configs["namespace"])
  File "/home/user/leo.li/AIOpsLab/aiopslab/service/helm.py", line 123, in assert_if_deployed
    raise e
  File "/home/user/leo.li/AIOpsLab/aiopslab/service/helm.py", line 121, in assert_if_deployed
    kubectl.wait_for_ready(namespace)
  File "/home/user/leo.li/AIOpsLab/aiopslab/service/kubectl.py", line 100, in wait_for_ready
    raise Exception(f"[red]Timeout: Not all pods in namespace '{namespace}' reached the Ready state within {max_wait} seconds.")
Exception: [red]Timeout: Not all pods in namespace 'astronomy-shop' reached the Ready state within 300 seconds.

Expected behavior
After I manually increase the limit memory for kafka, it works:

== Helm Install ==
NAME: astronomy-shop
LAST DEPLOYED: Mon Aug 25 06:59:03 2025
NAMESPACE: astronomy-shop
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
=======================================================================================


 ██████╗ ████████╗███████╗██╗         ██████╗ ███████╗███╗   ███╗ ██████╗
██╔═══██╗╚══██╔══╝██╔════╝██║         ██╔══██╗██╔════╝████╗ ████║██╔═══██╗
██║   ██║   ██║   █████╗  ██║         ██║  ██║█████╗  ██╔████╔██║██║   ██║
██║   ██║   ██║   ██╔══╝  ██║         ██║  ██║██╔══╝  ██║╚██╔╝██║██║   ██║
╚██████╔╝   ██║   ███████╗███████╗    ██████╔╝███████╗██║ ╚═╝ ██║╚██████╔╝
 ╚═════╝    ╚═╝   ╚══════╝╚══════╝    ╚═════╝ ╚══════╝╚═╝     ╚═╝ ╚═════╝


- All services are available via the Frontend proxy: http://localhost:8080
  by running these commands:
     kubectl --namespace astronomy-shop port-forward svc/frontend-proxy 8080:8080

  The following services are available at these paths after the frontend-proxy service is exposed with port forwarding:
  Webstore             http://localhost:8080/
  Jaeger UI            http://localhost:8080/jaeger/ui/
  Grafana              http://localhost:8080/grafana/
  Load Generator UI    http://localhost:8080/loadgen/
  Feature Flags UI     http://localhost:8080/feature/

[06:59:05] Waiting for all pods in namespace 'astronomy-shop' to be ready...                                                                                           kubectl.py:83
[06:59:46] All pods in namespace 'astronomy-shop' are ready.                                                                                                          kubectl.py:100
== Fault Injection ==
ConfigMap 'flagd-config' updated in namespace 'astronomy-shop'
Fault injected: Feature flag 'adFailure' set to 'on'.
Fault: adServiceFailure | Namespace: astronomy-shop

== Start Workload ==
Workload skipped since AstronomyShop has a built-in load generator.
===== Agent (OpenRouter - qwen/qwen3-30b-a3b-instruct-2507) ====

is it appropriate to increase the limit directly in https://github.com/xlab-uiuc/opentelemetry-helm-charts/blob/68d88623f480687a0a50dcd7ca1ff3340bb1ff7a/charts/opentelemetry-demo/values.yaml like this?

diff --git a/charts/opentelemetry-demo/values.yaml b/charts/opentelemetry-demo/values.yaml
index b09ccbd3..77732a02 100644
--- a/charts/opentelemetry-demo/values.yaml
+++ b/charts/opentelemetry-demo/values.yaml
@@ -648,10 +648,12 @@ components:
       - name: OTEL_EXPORTER_OTLP_ENDPOINT
         value: http://$(OTEL_COLLECTOR_NAME):4318
       - name: KAFKA_HEAP_OPTS
-        value: "-Xmx400M -Xms400M"
+        value: "-Xmx512M -Xms512M"
     resources:
       limits:
-        memory: 600Mi
+        memory: 1Gi

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions