Skip to content

fix: remove bank_id from OTel metric attributes to prevent unbounded memory growth#857

Open
octo-patch wants to merge 1 commit intovectorize-io:mainfrom
octo-patch:fix/issue-850-remove-bank-id-from-otel-metric-attributes
Open

fix: remove bank_id from OTel metric attributes to prevent unbounded memory growth#857
octo-patch wants to merge 1 commit intovectorize-io:mainfrom
octo-patch:fix/issue-850-remove-bank-id-from-otel-metric-attributes

Conversation

@octo-patch
Copy link
Copy Markdown
Contributor

Fixes #850

Problem

MetricsCollector.record_operation() includes bank_id as an attribute in OpenTelemetry histogram and counter recordings. Since bank_id is a per-user value (e.g., user-123), every unique user creates a permanent, never-evicted time series in the OTel SDK's in-memory aggregation buffers, causing unbounded memory growth proportional to unique_users × operations × budgets × statuses.

After 15 days of normal usage with ~50 users, this was observed to cause ~3 GB physical memory footprint due to millions of never-freed allocations in MALLOC_SMALL.

Solution

Remove bank_id from metric attributes in record_operation(). The bank_id parameter is kept in the method signature for API compatibility, but is no longer added to the OTel attributes dict. bank_id belongs in tracing spans (which are exported and evicted), not in metrics (which accumulate in process for the lifetime of the SDK).

Updated the test assertion to verify bank_id is NOT present in the attributes.

Testing

  • Updated test_metrics.py to assert "bank_id" not in attributes
  • Existing tests continue to pass as the fix is minimal (one line removed)

…memory growth

bank_id is a per-user high-cardinality label that caused unbounded OTel
histogram growth in-process. Each unique bank_id creates a permanent
time series in OTel's in-memory aggregation buffers, leading to ~3 GB
memory growth after 15 days with ~50 users.

Fixes vectorize-io#850
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory leak: bank_id as metric label causes unbounded OTel histogram growth (~3 GB after 15 days)

1 participant