Architecture Review: K8s Troubleshooting with Nemoclaw + Nemotron-3-120B #966

uzunenes · 2026-03-26T08:40:30Z

uzunenes
Mar 26, 2026

Hi everyone, looking for best practice advice on this architecture:

The Stack: nvidia/nemotron-3-super-120b-a12b (2x H200) + Nemoclaw on a separate VM (same DC, HTTP endpoint).

The Plan: Give Nemoclaw K8s API access (via token) to a test namespace. Trigger it via CronJob or Telegram to scan for issues and send SMTP alerts.

Important context: I already use Grafana alerts for deterministic problems. This LLM setup is strictly for rapid detection of complex, non-deterministic edge cases.

My Questions:

Is querying the K8s API directly the best practice for Nemoclaw in this scenario?
Alternatively, should I ship all namespace logs/events to my Grafana stack first and have Nemoclaw analyze them from there instead of direct K8s access?
Any quick tips on filtering the data to avoid blowing up the context window?
Thanks!

BenediktSchackenberg · 2026-03-26T12:58:19Z

BenediktSchackenberg
Mar 26, 2026

Interesting setup. I've been running OpenClaw in a similar split-infra pattern (not K8s specifically, but agent + external APIs on separate hosts). A few thoughts:

1. Direct K8s API vs Grafana

Direct K8s API is better for your use case. The agent can query exactly what it needs (kubectl get events --field-selector reason=BackOff, pod logs, describe output) instead of parsing pre-aggregated dashboards. Grafana is great for time-series, but LLMs work better with structured, point-in-time snapshots.

That said — don't give it cluster-admin. Create a ServiceAccount with a Role scoped to your test namespace: get/list/watch on pods, events, deployments, replicasets, and get on logs. That's enough for troubleshooting without risk.

2. Context window management

This is the real challenge. A busy namespace can produce megabytes of events and logs per hour. What works:

Pre-filter in the query, not in the prompt. Use --since=15m on logs, --field-selector on events. Don't dump everything and ask the LLM to find the problem.
Two-pass approach: first pass queries high-level state (events, pod status, restart counts), second pass drills into specific pod logs only if the first pass flags something. Keeps most runs small.
Truncate per-pod logs to last ~200 lines. If the root cause isn't in the last 200 lines of a crashing pod, it's usually in events or describe output instead.

With Nemotron-3-120B you have a decent context window, but token cost per run adds up fast if you're scanning every 5 minutes.

3. Practical tip for the CronJob trigger

Rather than a fixed interval, consider having your CronJob check for recent warning events first (kubectl get events --field-selector type=Warning --since=5m). If the count is 0, skip the full analysis. Saves a lot of inference calls on quiet clusters.

What namespace complexity are we talking about? (number of pods, typical churn rate) That would help narrow down the filtering strategy.

0 replies

uzunenes · 2026-03-28T17:57:51Z

uzunenes
Mar 28, 2026
Author

The test namespace doesn't exist yet. I am currently creating it from scratch just to test nemoclaw. My plan is to start small by setting up a dummy test backend-app's, test databases etc. to validate the workflow and see how the LLM handles the context.

Once I prove this concept works, my main goal is to deploy it to real applications. I will probably split my applications into different namespaces and assign different roles across various VMs.

I will definitely apply your filtering and RBAC tips while building the system. I'll make sure to share my test results and findings here once the setup is up and running.

Thanks again.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture Review: K8s Troubleshooting with Nemoclaw + Nemotron-3-120B #966

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Architecture Review: K8s Troubleshooting with Nemoclaw + Nemotron-3-120B #966

Uh oh!

uzunenes Mar 26, 2026

Replies: 2 comments

Uh oh!

BenediktSchackenberg Mar 26, 2026

Uh oh!

uzunenes Mar 28, 2026 Author

uzunenes
Mar 26, 2026

BenediktSchackenberg
Mar 26, 2026

uzunenes
Mar 28, 2026
Author