Check and fix SBD-related timeout values #1932

liangxin1300 · 2025-09-28T09:17:58Z

This PR introduces the sbd option for the 'crm cluster health' command, and adds the class sbd.SBDTimeoutChecker to provide methods for checking and fixing SBD-related timeout values.

Check SBD-related configurations' consistency

# crm sbd configure show sysconfig 
INFO: crm sbd configure show sysconfig
SBD_PACEMAKER=yes
SBD_STARTMODE=always
SBD_DELAY_START=71
SBD_WATCHDOG_DEV=/dev/watchdog
SBD_TIMEOUT_ACTION=flush,reboot
SBD_MOVE_TO_ROOT_CGROUP=auto
SBD_SYNC_RESOURCE_STARTUP=yes
SBD_OPTS=
SBD_DEVICE=/dev/sda5
--- sle16-2
+++ sle16-1
@@ -48 +48 @@
-SBD_DELAY_START=no
+SBD_DELAY_START=71
WARNING: /etc/sysconfig/sbd is not consistent across cluster nodes
WARNING: Please ensure the configurations are consistent across all cluster nodes

# crm cluster health sbd
--- sle16-2
+++ sle16-1
@@ -48 +48 @@
-SBD_DELAY_START=no
+SBD_DELAY_START=71
WARNING: /etc/sysconfig/sbd is not consistent across cluster nodes
WARNING: Please ensure the configurations are consistent across all cluster nodes

Check and fix SBD disk metadata

# crm sbd configure show disk_metadata 
INFO: crm sbd configure show disk_metadata
==Dumping header on disk /dev/sda5
Header version     : 2.1
UUID               : b55e85dc-8dde-4a32-b238-d45a5f017963
Number of slots    : 255
Sector size        : 512
Timeout (watchdog) : 15
Timeout (allocate) : 2
Timeout (loop)     : 1
Timeout (msgwait)  : 16
==Header on disk /dev/sda5 is dumped

WARNING: It's recommended that msgwait(now 16) >= 2*watchdog timeout(now 15)

# crm cluster health sbd
WARNING: It's recommended that msgwait(now 16) >= 2*watchdog timeout(now 15)
ERROR: SBD: Check sbd timeout configuration: FAIL.
WARNING: Please run "crm cluster health sbd --fix"

# crm cluster health sbd --fix
INFO: Adjusting sbd msgwait to 30
INFO: Configuring disk-based SBD
INFO: Initializing SBD device /dev/sda5
INFO: Restarting cluster service
INFO: BEGIN Waiting for cluster
INFO: END Waiting for cluster
INFO: SBD: Check sbd timeout configuration: OK.

Check and fix SBD_DELAY_START

# crm sbd configure show sysconfig 
INFO: crm sbd configure show sysconfig
SBD_PACEMAKER=yes
SBD_STARTMODE=always
SBD_DELAY_START=40
SBD_WATCHDOG_DEV=/dev/watchdog0
SBD_TIMEOUT_ACTION=flush,reboot
SBD_MOVE_TO_ROOT_CGROUP=auto
SBD_SYNC_RESOURCE_STARTUP=yes
SBD_OPTS=
SBD_DEVICE=/dev/sda5
WARNING: It's recommended that SBD_DELAY_START is set to 71, now is 40

# crm cluster health sbd
WARNING: It's recommended that SBD_DELAY_START is set to 71, now is 40
ERROR: SBD: Check sbd timeout configuration: FAIL.
WARNING: Please run "crm cluster health sbd --fix"

# crm cluster health sbd --fix
INFO: Update SBD_DELAY_START in /etc/sysconfig/sbd: 71
INFO: Sync file /etc/sysconfig/sbd to sle16-2
INFO: Already synced /etc/sysconfig/sbd to all nodes
INFO: SBD: Check sbd timeout configuration: OK.

Check and fix SBD systemd start timeout

# crm sbd configure show property 
INFO: crm sbd configure show property
have-watchdog=true
stonith-enabled=true
stonith-timeout=119

INFO: crm configure show related:fence_sbd
primitive stonith-sbd stonith:fence_sbd \
        params pcmk_delay_max=30s

INFO: systemctl show -p TimeoutStartUSec sbd.service --value
TimeoutStartUSec=90
WARNING: It's recommended that systemd start timeout for sbd.service is set to 157s, now is 90s

# crm cluster health sbd
WARNING: It's recommended that systemd start timeout for sbd.service is set to 157s, now is 90s
ERROR: SBD: Check sbd timeout configuration: FAIL.
WARNING: Please run "crm cluster health sbd --fix"

# crm cluster health sbd --fix
INFO: Adjusting systemd start timeout for sbd.service to 157s
INFO: Sync directory /etc/systemd/system/sbd.service.d to sle16-2
INFO: SBD: Check sbd timeout configuration: OK.

Check and fix stonith-timeout

# crm sbd configure show property 
INFO: crm sbd configure show property
have-watchdog=true
stonith-enabled=true

INFO: crm configure show related:fence_sbd
primitive stonith-sbd stonith:fence_sbd \
        params pcmk_delay_max=30s

INFO: systemctl show -p TimeoutStartUSec sbd.service --value
TimeoutStartUSec=157
WARNING: It's recommended that stonith-timeout is set to 119, now is not set

# crm cluster health sbd
WARNING: It's recommended that stonith-timeout is set to 119, now is not set
ERROR: SBD: Check sbd timeout configuration: FAIL.
WARNING: Please run "crm cluster health sbd --fix"

# crm cluster health sbd --fix
INFO: Adjusting stonith-timeout to 119
WARNING: "stonith-timeout" in crm_config is set to 119, it was 60s
INFO: SBD: Check sbd timeout configuration: OK.

Check and fix stonith-watchdog-timeout

# crm sbd configure show property 
INFO: crm sbd configure show property
have-watchdog=true
stonith-enabled=true
stonith-timeout=119
stonith-watchdog-timeout=30

INFO: crm configure show related:fence_sbd
primitive stonith-sbd stonith:fence_sbd \
        params pcmk_delay_max=30s

INFO: systemctl show -p TimeoutStartUSec sbd.service --value
TimeoutStartUSec=157
WARNING: It's recommended that stonith-watchdog-timeout is not set when using disk-based SBD

 # crm cluster health sbd 
WARNING: It's recommended that stonith-watchdog-timeout is not set when using disk-based SBD
ERROR: SBD: Check sbd timeout configuration: FAIL.
WARNING: Please run "crm cluster health sbd --fix"

# crm cluster health sbd --fix
INFO: Removing stonith-watchdog-timeout property
INFO: Delete cluster property "stonith-watchdog-timeout" in crm_config
INFO: SBD: Check sbd timeout configuration: OK.

# For diskless case
# crm sbd configure show property 
INFO: crm sbd configure show property
have-watchdog=true
stonith-enabled=true
stonith-timeout=71

INFO: systemctl show -p TimeoutStartUSec sbd.service --value
TimeoutStartUSec=90
WARNING: It's recommended that stonith-watchdog-timeout is set to at least 30, now is not set

# crm cluster health sbd
WARNING: It's recommended that stonith-watchdog-timeout is set to at least 30, now is not set
ERROR: SBD: Check sbd timeout configuration: FAIL.
WARNING: Please run "crm cluster health sbd --fix"

# crm cluster health sbd --fix
INFO: Adjusting stonith-watchdog-timeout to 30
WARNING: "stonith-watchdog-timeout" in crm_config is set to 30, it was 0
INFO: SBD: Check sbd timeout configuration: OK.

Other cases

corosync token timeout increase

# crm cluster health sbd
WARNING: It's recommended that SBD_DELAY_START is set to 82, now is 71
ERROR: SBD: Check sbd timeout configuration: FAIL.
WARNING: Please run "crm cluster health sbd --fix"

# crm cluster health sbd --fix
INFO: Update SBD_DELAY_START in /etc/sysconfig/sbd: 82
INFO: Sync file /etc/sysconfig/sbd to sle16-2
INFO: Already synced /etc/sysconfig/sbd to all nodes
INFO: Adjusting systemd start timeout for sbd.service to 98s
INFO: Sync directory /etc/systemd/system/sbd.service.d to sle16-2
INFO: Adjusting stonith-timeout to 82
WARNING: "stonith-timeout" in crm_config is set to 82, it was 71
INFO: SBD: Check sbd timeout configuration: OK.

nicholasyang2022 · 2025-11-10T09:21:31Z

crmsh/ui_cluster.py

+            case 'sbd':
+                fix = parsed_args.fix
+                try:
+                    warn = False if fix else True


Suggested change

warn = False if fix else True

warn = not fix

nicholasyang2022 · 2025-11-10T09:28:52Z

crmsh/sbd.py

 from . import xmlutil
 from . import watchdog
 from . import parallax
+from . import healthcheck


It does not make sense to add a dependency to module healtcheck just for exception FixFailure. It is just a ordinary subclass of Exception without any extra features.

nicholasyang2022 · 2025-11-10T09:33:54Z

crmsh/sbd.py

-        utils.cluster_run_cmd(f"{test_dir_cmd} && {rm_dir_cmd} && {reload_cmd} || exit 0")
+class SBDTimeoutChecker(SBDTimeout):
+
+    def __init__(self, warn=True, fix=False, filter_str: str = "", from_bootstrap=False):


Suggested change

def __init__(self, warn=True, fix=False, filter_str: str = "", from_bootstrap=False):

def __init__(self, warn=True, fix=False, check_category: str = "", from_bootstrap=False):

codecov · 2025-11-11T09:26:14Z

Codecov Report

❌ Patch coverage is 79.69543% with 40 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.81%. Comparing base (9d10628) to head (c762336).

Files with missing lines	Patch %	Lines
crmsh/sbd.py	85.11%	25 Missing ⚠️
crmsh/ui_cluster.py	26.31%	14 Missing ⚠️
crmsh/ui_sbd.py	83.33%	1 Missing ⚠️

Additional details and impacted files

Flag	Coverage Δ
integration	`55.01% <15.22%> (-0.18%)`	⬇️
unit	`53.04% <77.15%> (+0.12%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
crmsh/bootstrap.py	`87.85% <100.00%> (ø)`
crmsh/utils.py	`67.58% <100.00%> (+0.01%)`	⬆️
crmsh/ui_sbd.py	`83.80% <83.33%> (+0.03%)`	⬆️
crmsh/ui_cluster.py	`70.78% <26.31%> (-1.62%)`	⬇️
crmsh/sbd.py	`86.10% <85.11%> (+2.21%)`	⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

liangxin1300 · 2025-11-13T09:33:27Z

Miss: check and fix for crashdump timeout

to check and fix SBD-related timeouts values. The check cases include: SBD-related configurations' consistency across cluster nodes, SBD disk metadata, SBD_WATCHDOG_TIMEOUT, SBD_DELAY_START, sbd systemd start timeout, stonith-watchdog-timeout property, and stonith-timeout property. Remove serval methods with the same logic from sbd.SBDTimeout class

when calculating expected msgwait and stonith-watchdog-timeout

liangxin1300 force-pushed the 20250917_health_timeout branch 9 times, most recently from 1c3416b to f8b0f1b Compare September 29, 2025 11:38

liangxin1300 force-pushed the 20250917_health_timeout branch 5 times, most recently from ba3a1bc to fe7983c Compare November 10, 2025 02:45

liangxin1300 changed the title ~~20250917 health timeout~~ Check and fix SBD related timeout values Nov 10, 2025

liangxin1300 changed the title ~~Check and fix SBD related timeout values~~ Check and fix SBD-related timeout values Nov 10, 2025

liangxin1300 force-pushed the 20250917_health_timeout branch from fe7983c to febb236 Compare November 10, 2025 03:51

nicholasyang2022 reviewed Nov 10, 2025

View reviewed changes

liangxin1300 force-pushed the 20250917_health_timeout branch 2 times, most recently from 0ee1753 to eb66873 Compare November 10, 2025 11:56

Dev: ui_cluster: Introduce sbd option for 'crm cluster health' command

b8b6741

liangxin1300 force-pushed the 20250917_health_timeout branch from eb66873 to d78e638 Compare November 11, 2025 08:57

liangxin1300 force-pushed the 20250917_health_timeout branch 4 times, most recently from 2736d67 to 8baf58e Compare November 12, 2025 12:34

liangxin1300 marked this pull request as ready for review November 12, 2025 12:42

liangxin1300 requested a review from zzhou1 November 12, 2025 12:42

liangxin1300 added 2 commits November 13, 2025 18:07

Dev: doc: Add help info for crm cluster health sbd

359cbc7

liangxin1300 force-pushed the 20250917_health_timeout branch from 8baf58e to d3589bb Compare November 13, 2025 10:07

liangxin1300 added 3 commits November 14, 2025 10:54

Dev: behave: Adjust functional test for previous commits

8a1b276

Dev: sbd: Take crashdump watchdog timeout into account

d146555

when calculating expected msgwait and stonith-watchdog-timeout

Dev: unittests: Adjust unit test for previous commit

c762336

liangxin1300 force-pushed the 20250917_health_timeout branch from d3589bb to c762336 Compare November 14, 2025 02:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Check and fix SBD-related timeout values #1932

Check and fix SBD-related timeout values #1932

Uh oh!

liangxin1300 commented Sep 28, 2025 •

edited

Loading

Uh oh!

nicholasyang2022 Nov 10, 2025

Uh oh!

nicholasyang2022 Nov 10, 2025

Uh oh!

nicholasyang2022 Nov 10, 2025

Uh oh!

codecov bot commented Nov 11, 2025 •

edited

Loading

Uh oh!

liangxin1300 commented Nov 13, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	def __init__(self, warn=True, fix=False, filter_str: str = "", from_bootstrap=False):
	def __init__(self, warn=True, fix=False, check_category: str = "", from_bootstrap=False):

Check and fix SBD-related timeout values #1932

Are you sure you want to change the base?

Check and fix SBD-related timeout values #1932

Uh oh!

Conversation

liangxin1300 commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Check SBD-related configurations' consistency

Check and fix SBD disk metadata

Check and fix SBD_DELAY_START

Check and fix SBD systemd start timeout

Check and fix stonith-timeout

Check and fix stonith-watchdog-timeout

Other cases

Uh oh!

nicholasyang2022 Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

nicholasyang2022 Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

nicholasyang2022 Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

liangxin1300 commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

liangxin1300 commented Sep 28, 2025 •

edited

Loading

codecov bot commented Nov 11, 2025 •

edited

Loading

liangxin1300 commented Nov 13, 2025 •

edited

Loading