Skip to content

mysql-k8s r404: database-relation-broken hook fails with KeyError 'cluster-name' on scale-down #205

@dmvdm

Description

@dmvdm

Note: This issue was generated with AI assistance (GitHub Copilot) based on automated log analysis and triage.
Filed by @canonical/solutions-qa


Summary

mysql-k8s charm revision 404 (channel 8.4/edge) fails scale-down operations due to an unhandled KeyError: 'cluster-name' in the database-relation-broken hook handler. This prevents units from being removed and causes integration tests to timeout.

Root Cause

The _on_database_broken() handler in /src/relations/mysql_provider.py at line 272 attempts to access self.app_peer_data["cluster-name"] without checking if the key exists:

# File: src/relations/mysql_provider.py, line 272
def _on_database_broken(self, event: RelationBrokenEvent):
    # ... code ...
    if self.charm._mysql.does_mysql_user_exist(self._get_username(relation_id), "%"):
       ^^^^^^^^^^^^^^^^^
       
# File: src/charm.py, line 204
@property
def _mysql(self) -> MySQL:
    return MySQL(
        self.app_peer_data["cluster-name"],  # ← KeyError when key doesn't exist
        # ... other params ...
    )

Exception Traceback:

File "/var/lib/juju/agents/unit-target-0/charm/src/relations/mysql_provider.py", line 272, in _on_database_broken
    if self.charm._mysql.does_mysql_user_exist(self._get_username(relation_id), "%"):
       ^^^^^^^^^^^^^^^^^
File "/var/lib/juju/agents/unit-target-0/charm/src/charm.py", line 204, in _mysql
    self.app_peer_data["cluster-name"],
    ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
KeyError: 'cluster-name'

Impact

  • Scale-down operations fail - Units cannot be removed
  • Unit enters error state - target/0* error idle hook failed: "database-relation-broken"
  • Juju continuously retries - Hook is retried 6+ times, all fail with same error
  • Integration tests timeout - Tests wait 10 minutes for unit removal, then fail
  • Blocks deployments - Prevents cleanup and redeployment of mysql-k8s charm

Test Failure Details

  • Failed Test: test_scale_in_and_scale_out_charm
  • Execution ID: 443500
  • Test Result ID: 10174540
  • Charm: mysql-k8s
  • Revision: 404
  • Channel: 8.4/edge
  • Failure Rate: 100% (consistent failure on this revision)
  • Error: JujuWaitTimeoutError: Timed out while waiting for unit removal (applications: ['target'], units: ['target/0'])

Evidence from Juju Debug Logs

Hook Execution Failure (repeats 6+ times):

2026-03-31T19:37:51.500Z [container-agent] 2026-03-31 19:37:51 ERROR juju-log database:5: root:Uncaught exception while in charm code:
2026-03-31T19:37:51.500Z [container-agent] Traceback (most recent call last):
2026-03-31T19:37:51.500Z [container-agent]   File "/var/lib/juju/agents/unit-target-0/charm/src/charm.py", line 1108, in <module>
2026-03-31T19:37:51.500Z [container-agent]     main(MySQLOperatorCharm)
2026-03-31T19:37:51.500Z [container-agent]   ...
2026-03-31T19:37:51.500Z [container-agent]   File "/var/lib/juju/agents/unit-target-0/charm/src/relations/mysql_provider.py", line 272, in _on_database_broken
2026-03-31T19:37:51.500Z [container-agent]     if self.charm._mysql.does_mysql_user_exist(self._get_username(relation_id), "%"):
2026-03-31T19:37:51.500Z [container-agent]        ^^^^^^^^^^^^^^^^^
2026-03-31T19:37:51.500Z [container-agent]   File "/var/lib/juju/agents/unit-target-0/charm/src/charm.py", line 204, in _mysql
2026-03-31T19:37:51.500Z [container-agent]     self.app_peer_data["cluster-name"],
2026-03-31T19:37:51.500Z [container-agent]     ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2026-03-31T19:37:51.500Z [container-agent] KeyError: 'cluster-name'
2026-03-31T19:37:51.974Z [container-agent] 2026-03-31 19:37:51 ERROR juju.worker.uniter.operation runhook.go:180 hook "database-relation-broken" (via hook dispatching script: dispatch) failed: exit status 1

Hook Retry Timeline:

19:37:51 - Attempt 1: KeyError 'cluster-name'
19:37:58 - Attempt 2: KeyError 'cluster-name'
19:38:11 - Attempt 3: KeyError 'cluster-name'
19:38:34 - Attempt 4: KeyError 'cluster-name'
19:39:18 - Attempt 5: KeyError 'cluster-name'
19:40:46 - Attempt 6: KeyError 'cluster-name'
... (continues)
19:48:02 - Unit status shows error state
19:47:31 - Test timeout

Unit Status at Failure:

target/0:
  workload-status:
    current: error
    message: 'hook failed: "database-relation-broken" for neighbor:database'
    since: 31 Mar 2026 19:48:02Z
  juju-status:
    current: idle
    since: 31 Mar 2026 19:48:02Z

Regression Analysis

  • Revision 404 (current): ✗ FAILS (100% failure rate on scale-down)
  • Previous revisions: Likely PASS (needs verification)
  • Conclusion: Bug introduced in revision 404

Probable Cause

During scale-down of a mysql-k8s cluster:

  1. The relation-broken hook is triggered when removing the relation to the remote application (e.g., wordpress-k8s)
  2. The hook tries to access the MySQL object to check if the remote user exists
  3. The _mysql property attempts to read self.app_peer_data["cluster-name"]
  4. At this point in the scale-down lifecycle, the cluster-name key may not be available in peer data
  5. An unhandled KeyError is raised, causing hook failure
  6. Juju marks the unit as in error state and retries (up to 10 times)
  7. Unit cannot transition to removed state, causing test timeout

Recommended Fix

The charm should use defensive dictionary access in the _mysql property or the _on_database_broken hook:

Option 1 (Recommended - in _mysql property):

@property
def _mysql(self) -> MySQL:
    cluster_name = self.app_peer_data.get("cluster-name")
    if not cluster_name:
        # Handle gracefully during scale-down when cluster-name may not be available
        raise RuntimeError("Cluster name not available - cluster may be scaling down")
    return MySQL(
        cluster_name,
        # ... other params ...
    )

Option 2 (Guard in hook handler):

def _on_database_broken(self, event: RelationBrokenEvent):
    relation_id = event.relation.id
    # Check if we can access cluster before attempting to remove users
    if "cluster-name" not in self.app_peer_data:
        # Cluster name unavailable, likely during scale-down - skip user cleanup
        logger.warning("Skipping user removal during scale-down: cluster-name not available")
        return
    
    if self.charm._mysql.does_mysql_user_exist(self._get_username(relation_id), "%"):
        # ... rest of the handler

Option 3 (Conditional property access):

def _on_database_broken(self, event: RelationBrokenEvent):
    relation_id = event.relation.id
    try:
        if self.charm._mysql.does_mysql_user_exist(self._get_username(relation_id), "%"):
            # ... remove the user
    except KeyError as e:
        # During scale-down, peer data may be unavailable
        logger.warning(f"Skipping user cleanup during relation break: {e}")
        return

The root issue is that the code assumes cluster-name key exists when it may not have been set or may be cleared during scale-down scenarios.

Test Observer Link

View the failure with complete juju logs:
https://test-observer.canonical.com/#/charms/406079?testExecutionId=443500&testResultId=10174540

Related Issues

This issue follows the same pattern as Issue #202 (mysql-k8s r400: logging-relation-broken hook fails with KeyError 'logs_synced' on scale-down), which reports a similar unhandled KeyError in a different relation-broken hook. The fix pattern is identical - use defensive dictionary access instead of direct key access.

Related Files

  • Source: src/relations/mysql_provider.py (line 272)
  • Source: src/charm.py (line 204)
  • Test: charm-integration-testing/test_scale_in_and_scale_out_charm
  • Charm: canonical/mysql-operators (mysql-k8s package)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working as expected

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions