Skip to content

Memory leak: bank_id as metric label causes unbounded OTel histogram growth (~3 GB after 15 days) #850

@sireika

Description

@sireika

Bug Description

MetricsCollector.record_operation() in hindsight_api/metrics.py (line ~333) includes bank_id as an attribute in OpenTelemetry histogram and counter recordings. Since bank_id is a per-user value (e.g., user-123), every unique user creates a permanent, never-evicted time series in the OTel SDK's in-memory aggregation buffers.

Over time this causes unbounded memory growth proportional to unique_users × operations × budgets × statuses.

Steps to Reproduce

  1. Run hindsight-api with default configuration (metrics enabled)
  2. Issue recall/retain/reflect requests for multiple distinct bank_id values
  3. Observe memory growth via vmmap --summary <pid> (macOS) or /proc/<pid>/smaps (Linux)

After 15 days of normal usage on a dev instance with ~50 users, we observed:

  • Physical footprint: 3.1 GB (peak 3.2 GB)
  • MALLOC_SMALL: 1.7 GB virtual, 1.6 GB dirty, 15.2 million allocations
  • RSS reported by ps was only ~15 MB because macOS compressed the allocations, which also caused PM2's max_memory_restart (RSS-based) to never trigger

Root cause

# hindsight_api/metrics.py, MetricsCollector.record_operation()
  attributes = {
      "operation": operation,
      "bank_id": bank_id,      # <-- HIGH CARDINALITY: one series per user
      "source": source,
      "tenant": _get_tenant(),
  }

The OTel SDK's ExplicitBucketHistogramAggregation stores a full bucket array per unique attribute set. With default 16 histogram buckets, each unique {operation, bank_id, source, tenant, budget, max_tokens, success} tuple allocates ~400 bytes that are never freed. The combinatorial explosion creates millions of allocations.

Suggested fix

Remove bank_id from metric attributes. It belongs in tracing spans (which are exported and evicted), not in metrics (which accumulate in process for the lifetime of the SDK).

  attributes = {
      "operation": operation,
      # bank_id removed — high-cardinality label causes unbounded memory growth
      "source": source,
      "tenant": _get_tenant(),
  }

If per bank observability is needed in metrics, perhaps consider a bounded approach like hashing bank_id into a small number of buckets (for example, bank_id_bucket: str(hash(bank_id) % 64)).

Environment

  • hindsight-api-slim v0.4.18
  • Python 3.13.3
  • macOS 26.1 (ARM64)
  • OpenTelemetry SDK (via opentelemetry-sdk)

Workaround

Patch hindsight_api/metrics.py locally to remove bank_id from the attributes dict in record_operation().

Expected Behavior

hindsight-api should maintain stable memory usage over time when serving a fixed number of users. The OTel metrics subsystem should use bounded, low cardinality labels so that memory consumption is proportional to the number of distinct metric dimensions (operation types, sources, statuses) instead of the number of distinct users.

Actual Behavior

MetricsCollector.record_operation() includes bank_id (a per user value like user-123) as an OpenTelemetry metric attribute. Every unique user creates never evicted time series in the OTel SDK's in-memory histogram aggregation buffers. Memory grows linearly with the number of distinct users over the process lifetime.

Version

0.4.18

LLM Provider

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions