Skip to content

fix: use replication/role metric for authoritative node role detection (GTI-608)#23

Merged
ymendez-redis merged 5 commits intomainfrom
GTI-608/fix-replica-only-node-role
Apr 24, 2026
Merged

fix: use replication/role metric for authoritative node role detection (GTI-608)#23
ymendez-redis merged 5 commits intomainfrom
GTI-608/fix-replica-only-node-role

Conversation

@ymendez-redis
Copy link
Copy Markdown
Collaborator

Problem

memorystore.py determines node roles (Master/Replica) by reading the role label from redis.googleapis.com/commands/calls. After a failover, GCP does not immediately update this label, causing both nodes to report as replica. This affects ~93 Standard Tier instances, resulting in ~408.5 GB of missed memory in downstream calculations.

Root Cause

The role label on commands/calls is metadata — not designed to authoritatively report node roles. A dedicated metric exists: redis.googleapis.com/replication/role (1=primary, 0=replica).

Fix

  • Added replication_role to REDIS_METRICS
  • Added _attach_node_role() function that queries replication/role and overwrites the unreliable role label
  • Called after initial data collection in collect_for_product()

Testing — Reproduced on live GCP instance

Triggered a manual failover on memorystore-redis-instance (Standard HA) in redislabs-sales-pivotal:

File Fix Result
scan_115638_NO_FIX.csv ❌ No Both nodes: Replica
scan_115638_WITH_FIX.csv ✅ Yes Master node identified
10 consecutive scans with fix ✅ Yes All 10: Master present

Raw GCP metrics after failover confirmed the discrepancy:

commands/calls (unreliable):  node-0=replica, node-1=replica  ← BOTH REPLICA
replication/role (reliable):  node-0=primary                  ← CORRECT

Ref: https://cloud.google.com/memorystore/docs/redis/supported-monitoring-metrics

The commands/calls metric's 'role' label sometimes reports both nodes
as 'replica' for Standard Tier instances. This causes redis2re to
calculate 0 bytes for those clusters and fall back to the 0.1 GB minimum.

Query redis.googleapis.com/replication/role (1=primary, 0=replica) after
initial data collection to overwrite NodeRole with the authoritative value.

Ref: https://cloud.google.com/memorystore/docs/redis/supported-monitoring-metrics
Fixes: GTI-608
Comment thread msstats.py
)

(options, _) = parser.parse_args()
options, _ = parser.parse_args()
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatted because checks were not passing

@ymendez-redis ymendez-redis merged commit 76aa806 into main Apr 24, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants