Skip to content

Conversation

@liangxin1300
Copy link
Collaborator

@liangxin1300 liangxin1300 commented Sep 28, 2025

This PR introduces the sbd option for the 'crm cluster health' command, and adds the class sbd.SBDTimeoutChecker to provide methods for checking and fixing SBD-related timeout values.

Check SBD-related configurations' consistency

# crm sbd configure show sysconfig 
INFO: crm sbd configure show sysconfig
SBD_PACEMAKER=yes
SBD_STARTMODE=always
SBD_DELAY_START=71
SBD_WATCHDOG_DEV=/dev/watchdog
SBD_TIMEOUT_ACTION=flush,reboot
SBD_MOVE_TO_ROOT_CGROUP=auto
SBD_SYNC_RESOURCE_STARTUP=yes
SBD_OPTS=
SBD_DEVICE=/dev/sda5
--- sle16-2
+++ sle16-1
@@ -48 +48 @@
-SBD_DELAY_START=no
+SBD_DELAY_START=71
WARNING: /etc/sysconfig/sbd is not consistent across cluster nodes
WARNING: Please ensure the configurations are consistent across all cluster nodes
# crm cluster health sbd
--- sle16-2
+++ sle16-1
@@ -48 +48 @@
-SBD_DELAY_START=no
+SBD_DELAY_START=71
WARNING: /etc/sysconfig/sbd is not consistent across cluster nodes
WARNING: Please ensure the configurations are consistent across all cluster nodes

Check and fix SBD disk metadata

# crm sbd configure show disk_metadata 
INFO: crm sbd configure show disk_metadata
==Dumping header on disk /dev/sda5
Header version     : 2.1
UUID               : b55e85dc-8dde-4a32-b238-d45a5f017963
Number of slots    : 255
Sector size        : 512
Timeout (watchdog) : 15
Timeout (allocate) : 2
Timeout (loop)     : 1
Timeout (msgwait)  : 16
==Header on disk /dev/sda5 is dumped

WARNING: It's recommended that msgwait(now 16) >= 2*watchdog timeout(now 15)

# crm cluster health sbd
WARNING: It's recommended that msgwait(now 16) >= 2*watchdog timeout(now 15)
ERROR: SBD: Check sbd timeout configuration: FAIL.
WARNING: Please run "crm cluster health sbd --fix"

# crm cluster health sbd --fix
INFO: Adjusting sbd msgwait to 30
INFO: Configuring disk-based SBD
INFO: Initializing SBD device /dev/sda5
INFO: Restarting cluster service
INFO: BEGIN Waiting for cluster
INFO: END Waiting for cluster
INFO: SBD: Check sbd timeout configuration: OK.

Check and fix SBD_DELAY_START

# crm sbd configure show sysconfig 
INFO: crm sbd configure show sysconfig
SBD_PACEMAKER=yes
SBD_STARTMODE=always
SBD_DELAY_START=40
SBD_WATCHDOG_DEV=/dev/watchdog0
SBD_TIMEOUT_ACTION=flush,reboot
SBD_MOVE_TO_ROOT_CGROUP=auto
SBD_SYNC_RESOURCE_STARTUP=yes
SBD_OPTS=
SBD_DEVICE=/dev/sda5
WARNING: It's recommended that SBD_DELAY_START is set to 71, now is 40

# crm cluster health sbd
WARNING: It's recommended that SBD_DELAY_START is set to 71, now is 40
ERROR: SBD: Check sbd timeout configuration: FAIL.
WARNING: Please run "crm cluster health sbd --fix"

# crm cluster health sbd --fix
INFO: Update SBD_DELAY_START in /etc/sysconfig/sbd: 71
INFO: Sync file /etc/sysconfig/sbd to sle16-2
INFO: Already synced /etc/sysconfig/sbd to all nodes
INFO: SBD: Check sbd timeout configuration: OK.

Check and fix SBD systemd start timeout

# crm sbd configure show property 
INFO: crm sbd configure show property
have-watchdog=true
stonith-enabled=true
stonith-timeout=119

INFO: crm configure show related:fence_sbd
primitive stonith-sbd stonith:fence_sbd \
        params pcmk_delay_max=30s

INFO: systemctl show -p TimeoutStartUSec sbd.service --value
TimeoutStartUSec=90
WARNING: It's recommended that systemd start timeout for sbd.service is set to 157s, now is 90s

# crm cluster health sbd
WARNING: It's recommended that systemd start timeout for sbd.service is set to 157s, now is 90s
ERROR: SBD: Check sbd timeout configuration: FAIL.
WARNING: Please run "crm cluster health sbd --fix"

# crm cluster health sbd --fix
INFO: Adjusting systemd start timeout for sbd.service to 157s
INFO: Sync directory /etc/systemd/system/sbd.service.d to sle16-2
INFO: SBD: Check sbd timeout configuration: OK.

Check and fix stonith-timeout

# crm sbd configure show property 
INFO: crm sbd configure show property
have-watchdog=true
stonith-enabled=true

INFO: crm configure show related:fence_sbd
primitive stonith-sbd stonith:fence_sbd \
        params pcmk_delay_max=30s

INFO: systemctl show -p TimeoutStartUSec sbd.service --value
TimeoutStartUSec=157
WARNING: It's recommended that stonith-timeout is set to 119, now is not set

# crm cluster health sbd
WARNING: It's recommended that stonith-timeout is set to 119, now is not set
ERROR: SBD: Check sbd timeout configuration: FAIL.
WARNING: Please run "crm cluster health sbd --fix"

# crm cluster health sbd --fix
INFO: Adjusting stonith-timeout to 119
WARNING: "stonith-timeout" in crm_config is set to 119, it was 60s
INFO: SBD: Check sbd timeout configuration: OK.

Check and fix stonith-watchdog-timeout

# crm sbd configure show property 
INFO: crm sbd configure show property
have-watchdog=true
stonith-enabled=true
stonith-timeout=119
stonith-watchdog-timeout=30

INFO: crm configure show related:fence_sbd
primitive stonith-sbd stonith:fence_sbd \
        params pcmk_delay_max=30s

INFO: systemctl show -p TimeoutStartUSec sbd.service --value
TimeoutStartUSec=157
WARNING: It's recommended that stonith-watchdog-timeout is not set when using disk-based SBD

 # crm cluster health sbd 
WARNING: It's recommended that stonith-watchdog-timeout is not set when using disk-based SBD
ERROR: SBD: Check sbd timeout configuration: FAIL.
WARNING: Please run "crm cluster health sbd --fix"

# crm cluster health sbd --fix
INFO: Removing stonith-watchdog-timeout property
INFO: Delete cluster property "stonith-watchdog-timeout" in crm_config
INFO: SBD: Check sbd timeout configuration: OK.

# For diskless case
# crm sbd configure show property 
INFO: crm sbd configure show property
have-watchdog=true
stonith-enabled=true
stonith-timeout=71

INFO: systemctl show -p TimeoutStartUSec sbd.service --value
TimeoutStartUSec=90
WARNING: It's recommended that stonith-watchdog-timeout is set to at least 30, now is not set

# crm cluster health sbd
WARNING: It's recommended that stonith-watchdog-timeout is set to at least 30, now is not set
ERROR: SBD: Check sbd timeout configuration: FAIL.
WARNING: Please run "crm cluster health sbd --fix"

# crm cluster health sbd --fix
INFO: Adjusting stonith-watchdog-timeout to 30
WARNING: "stonith-watchdog-timeout" in crm_config is set to 30, it was 0
INFO: SBD: Check sbd timeout configuration: OK.

Other cases

corosync token timeout increase

# crm cluster health sbd
WARNING: It's recommended that SBD_DELAY_START is set to 82, now is 71
ERROR: SBD: Check sbd timeout configuration: FAIL.
WARNING: Please run "crm cluster health sbd --fix"

# crm cluster health sbd --fix
INFO: Update SBD_DELAY_START in /etc/sysconfig/sbd: 82
INFO: Sync file /etc/sysconfig/sbd to sle16-2
INFO: Already synced /etc/sysconfig/sbd to all nodes
INFO: Adjusting systemd start timeout for sbd.service to 98s
INFO: Sync directory /etc/systemd/system/sbd.service.d to sle16-2
INFO: Adjusting stonith-timeout to 82
WARNING: "stonith-timeout" in crm_config is set to 82, it was 71
INFO: SBD: Check sbd timeout configuration: OK.

@liangxin1300 liangxin1300 force-pushed the 20250917_health_timeout branch 9 times, most recently from 1c3416b to f8b0f1b Compare September 29, 2025 11:38
@liangxin1300 liangxin1300 force-pushed the 20250917_health_timeout branch 5 times, most recently from ba3a1bc to fe7983c Compare November 10, 2025 02:45
@liangxin1300 liangxin1300 changed the title 20250917 health timeout Check and fix SBD related timeout values Nov 10, 2025
@liangxin1300 liangxin1300 changed the title Check and fix SBD related timeout values Check and fix SBD-related timeout values Nov 10, 2025
@liangxin1300 liangxin1300 force-pushed the 20250917_health_timeout branch from fe7983c to febb236 Compare November 10, 2025 03:51
case 'sbd':
fix = parsed_args.fix
try:
warn = False if fix else True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
warn = False if fix else True
warn = not fix

crmsh/sbd.py Outdated
from . import xmlutil
from . import watchdog
from . import parallax
from . import healthcheck
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not make sense to add a dependency to module healtcheck just for exception FixFailure. It is just a ordinary subclass of Exception without any extra features.

crmsh/sbd.py Outdated
utils.cluster_run_cmd(f"{test_dir_cmd} && {rm_dir_cmd} && {reload_cmd} || exit 0")
class SBDTimeoutChecker(SBDTimeout):

def __init__(self, warn=True, fix=False, filter_str: str = "", from_bootstrap=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def __init__(self, warn=True, fix=False, filter_str: str = "", from_bootstrap=False):
def __init__(self, warn=True, fix=False, check_category: str = "", from_bootstrap=False):

@liangxin1300 liangxin1300 force-pushed the 20250917_health_timeout branch 2 times, most recently from 0ee1753 to eb66873 Compare November 10, 2025 11:56
@liangxin1300 liangxin1300 force-pushed the 20250917_health_timeout branch from eb66873 to d78e638 Compare November 11, 2025 08:57
@codecov
Copy link

codecov bot commented Nov 11, 2025

Codecov Report

❌ Patch coverage is 79.69543% with 40 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.81%. Comparing base (9d10628) to head (c762336).

Files with missing lines Patch % Lines
crmsh/sbd.py 85.11% 25 Missing ⚠️
crmsh/ui_cluster.py 26.31% 14 Missing ⚠️
crmsh/ui_sbd.py 83.33% 1 Missing ⚠️
Additional details and impacted files
Flag Coverage Δ
integration 55.01% <15.22%> (-0.18%) ⬇️
unit 53.04% <77.15%> (+0.12%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
crmsh/bootstrap.py 87.85% <100.00%> (ø)
crmsh/utils.py 67.58% <100.00%> (+0.01%) ⬆️
crmsh/ui_sbd.py 83.80% <83.33%> (+0.03%) ⬆️
crmsh/ui_cluster.py 70.78% <26.31%> (-1.62%) ⬇️
crmsh/sbd.py 86.10% <85.11%> (+2.21%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@liangxin1300 liangxin1300 force-pushed the 20250917_health_timeout branch 4 times, most recently from 2736d67 to 8baf58e Compare November 12, 2025 12:34
@liangxin1300 liangxin1300 marked this pull request as ready for review November 12, 2025 12:42
@liangxin1300 liangxin1300 requested a review from zzhou1 November 12, 2025 12:42
@liangxin1300
Copy link
Collaborator Author

liangxin1300 commented Nov 13, 2025

  • Miss: check and fix for crashdump timeout

to check and fix SBD-related timeouts values.
The check cases include: SBD-related configurations' consistency across cluster nodes,
SBD disk metadata, SBD_WATCHDOG_TIMEOUT, SBD_DELAY_START, sbd systemd start timeout,
stonith-watchdog-timeout property, and stonith-timeout property.

Remove serval methods with the same logic from sbd.SBDTimeout class
@liangxin1300 liangxin1300 force-pushed the 20250917_health_timeout branch from 8baf58e to d3589bb Compare November 13, 2025 10:07
@liangxin1300 liangxin1300 force-pushed the 20250917_health_timeout branch from d3589bb to c762336 Compare November 14, 2025 02:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants