Skip to content

feat: Add cache size to DedupTTLCache and log cache stats#36

Open
t29-cristian wants to merge 3 commits intoCheckmk:mainfrom
t29-cristian:cache-max-size-improvements
Open

feat: Add cache size to DedupTTLCache and log cache stats#36
t29-cristian wants to merge 3 commits intoCheckmk:mainfrom
t29-cristian:cache-max-size-improvements

Conversation

@t29-cristian
Copy link
Contributor

Add cache monitoring and alerting for container metrics

Added cache size monitoring to the cluster collector to detect and alert when the container metrics cache approaches capacity, preventing silent metric data loss.

Changes:

  • Added size() and utilization() methods to DedupTTLCache to report current cache usage
  • Enhanced update_container_metrics endpoint with tiered logging:
    • DEBUG: Normal operations (< 80% full)
    • ERROR: High utilization warning (80-94% full)
    • CRITICAL: Cache nearly full alert (≥ 95% full)
  • All log messages include cache_size, maxsize, and utilization percentage
  • Alerts provide actionable guidance to increase --cache-maxsize

This makes cache exhaustion visible at default log levels (previously silent) and helps operators identify when cache capacity needs to be increased.

@github-actions
Copy link

github-actions bot commented Jan 16, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@t29-cristian
Copy link
Contributor Author

I have read the CLA Document and I hereby sign the CLA or my organization already has a signed CLA.

@t29-cristian
Copy link
Contributor Author

I would argue that it is best not to have a cache limit at all in the first place by default since in kubernetes we have memory limits anyway which can take care of situations where a pod might use way more than needed. But since I am not entirely sure about the intention of the developers, I am just adding something to help me right-size for our environments and tell me when I run the risk of having values temporarily missing.

- needed so that we can have these log lines exposed externally
@t29-cristian
Copy link
Contributor Author

t29-cristian commented Jan 16, 2026

When we would set the logger to debug now, we would see:
2026-01-16 14:57:38,669 - checkmk_kube_agent.api - DEBUG - Container metrics updated: received=1926, cache_size=18404/100000 (18.4% full)
2026-01-16 14:57:38,732 - checkmk_kube_agent.api - DEBUG - Container metrics updated: received=890, cache_size=18404/100000 (18.4% full)
[...]

The best approach would be to expose a metric that we can scrape in the cluster so that we can alert on it, but at least this way we have something to go on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant