-
Notifications
You must be signed in to change notification settings - Fork 399
Memory leak: bank_id as metric label causes unbounded OTel histogram growth (~3 GB after 15 days) #850
Description
Bug Description
MetricsCollector.record_operation() in hindsight_api/metrics.py (line ~333) includes bank_id as an attribute in OpenTelemetry histogram and counter recordings. Since bank_id is a per-user value (e.g., user-123), every unique user creates a permanent, never-evicted time series in the OTel SDK's in-memory aggregation buffers.
Over time this causes unbounded memory growth proportional to unique_users × operations × budgets × statuses.
Steps to Reproduce
- Run
hindsight-apiwith default configuration (metrics enabled) - Issue recall/retain/reflect requests for multiple distinct
bank_idvalues - Observe memory growth via
vmmap --summary <pid>(macOS) or/proc/<pid>/smaps(Linux)
After 15 days of normal usage on a dev instance with ~50 users, we observed:
- Physical footprint: 3.1 GB (peak 3.2 GB)
- MALLOC_SMALL: 1.7 GB virtual, 1.6 GB dirty, 15.2 million allocations
- RSS reported by ps was only ~15 MB because macOS compressed the allocations, which also caused PM2's
max_memory_restart(RSS-based) to never trigger
Root cause
# hindsight_api/metrics.py, MetricsCollector.record_operation()
attributes = {
"operation": operation,
"bank_id": bank_id, # <-- HIGH CARDINALITY: one series per user
"source": source,
"tenant": _get_tenant(),
}
The OTel SDK's ExplicitBucketHistogramAggregation stores a full bucket array per unique attribute set. With default 16 histogram buckets, each unique {operation, bank_id, source, tenant, budget, max_tokens, success} tuple allocates ~400 bytes that are never freed. The combinatorial explosion creates millions of allocations.
Suggested fix
Remove bank_id from metric attributes. It belongs in tracing spans (which are exported and evicted), not in metrics (which accumulate in process for the lifetime of the SDK).
attributes = {
"operation": operation,
# bank_id removed — high-cardinality label causes unbounded memory growth
"source": source,
"tenant": _get_tenant(),
}
If per bank observability is needed in metrics, perhaps consider a bounded approach like hashing bank_id into a small number of buckets (for example, bank_id_bucket: str(hash(bank_id) % 64)).
Environment
- hindsight-api-slim v0.4.18
- Python 3.13.3
- macOS 26.1 (ARM64)
- OpenTelemetry SDK (via opentelemetry-sdk)
Workaround
Patch hindsight_api/metrics.py locally to remove bank_id from the attributes dict in record_operation().
Expected Behavior
hindsight-api should maintain stable memory usage over time when serving a fixed number of users. The OTel metrics subsystem should use bounded, low cardinality labels so that memory consumption is proportional to the number of distinct metric dimensions (operation types, sources, statuses) instead of the number of distinct users.
Actual Behavior
MetricsCollector.record_operation() includes bank_id (a per user value like user-123) as an OpenTelemetry metric attribute. Every unique user creates never evicted time series in the OTel SDK's in-memory histogram aggregation buffers. Memory grows linearly with the number of distinct users over the process lifetime.
Version
0.4.18
LLM Provider
None