Skip to content

Commit 495056d

Browse files
authored
AB#6887: Create windows-server-failover-cluster-health-service.md (#10054)
* Create windows-server-failover-cluster-health-service.md * Update windows-server-failover-cluster-health-service.md
1 parent 44f619b commit 495056d

File tree

1 file changed

+203
-0
lines changed

1 file changed

+203
-0
lines changed
Lines changed: 203 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,203 @@
1+
---
2+
title: Windows Server Failover Cluster Health Service Troubleshooting Guide
3+
description: Resolves issues that affect the Windows Server Failover Cluster Health Service.
4+
ms.date: 10/06/2025
5+
author: kaushika-msft
6+
ms.author: kaushika
7+
manager: dcscontentpm
8+
audience: itpro
9+
ms.topic: troubleshooting
10+
ms.reviewer: kaushika
11+
ms.custom:
12+
- sap: high availability\health service
13+
- pcy: High availability\health service
14+
appliesto:
15+
- <a href=https://learn.microsoft.com/windows/release-health/windows-server-release-info target=_blank>Supported versions of Windows Server</a>
16+
---
17+
18+
# Windows Server failover cluster health service troubleshooting guide
19+
20+
## Summary
21+
22+
Windows Server Health Service is a key component of Windows Server, Failover Clustering, and System Center environments. It proactively monitors the health and availability of critical cluster resources, services, disks, storage, and network connectivity, and generates alerts to help administrators address faults before they lead to outages.
23+
If the Health Service malfunctions, doesn't start, or produces persistent warnings, it can mask underlying problems or generate false positives, endangering both availability and performance.
24+
25+
This article provides a detailed checklist for troubleshooting Health Service issues. It covers diagnostic steps, common error scenarios, verified solutions, and tools for collecting data for root cause analysis.
26+
27+
## Troubleshooting checklist
28+
29+
Use this checklist for systematic troubleshooting:
30+
31+
- Verify Health Service Status
32+
- Check the status in Failover Cluster Manager (mmc), Windows Event Viewer, and PowerShell: Get-Service -Name HealthServiceGet-ClusterResource | Where-Object {$_.ResourceType -eq "HealthService"}
33+
- Review Recent Alerts and Events
34+
- Inspect the System, Application, and FailoverClustering event logs for errors or recurrent warnings (Event Viewer).
35+
- Note Event IDs such as 0, 7024, 7031, 7034, 5120, 5121, 5126.
36+
- Check Cluster Resource State
37+
- Verify that the Health Service is online (green state) in the cluster dashboard.
38+
- Make sure that no dependent resources or critical services are failing.
39+
- Assess System Prerequisites
40+
- Verify that cluster nodes have consistent OS and update levels.
41+
- Sufficient disk space, memory, and network connectivity exist.
42+
- Antivirus or security software exclusions are configured for cluster paths and binaries.
43+
- Review Recent Changes
44+
- Any recent updates, upgrades, network changes, cluster configuration modifications, or disk additions?
45+
- Roll back or validate the impact of recent updates if symptoms began after changes.
46+
- Examine Network and Firewall Settings
47+
- Make sure that required ports (both cluster and Health Service) are open between nodes (for example, TCP/UDP 5985, 135, 445, 3343).
48+
- Verify Disk and Storage Health
49+
- Check CSV, quorum disks, or witness disk state.
50+
- Run chkdsk and review health diagnostics.
51+
- Synchronize Time and Certificates
52+
- NTP is properly configured; no significant clock drift across cluster nodes.
53+
- Certificates used by Health Service (if any) are valid and not expired.
54+
- Run Cluster Validation
55+
- Use the Cluster Validation Wizard in Failover Cluster Manager for automated checks, focusing on storage, networking, and system configuration.
56+
57+
## Common issues and solutions
58+
59+
### 1. Health service doesn't start or goes offline
60+
61+
#### Symptoms
62+
63+
- Failover Cluster Manager shows the Health Service as offline or in a failed state.
64+
- Event IDs: 7024, 7031, 0 (service terminated unexpectedly), Cluster event 1205/1069.
65+
66+
#### Cause and resolution
67+
68+
- Corrupted Health Service Binary/Install:
69+
- Action: Uninstall and reinstall the Health Service role/component.
70+
- System Integrity Violation:
71+
- Action: Run sfc /scannow and DISM repair on each node.DISM /Online /Cleanup-Image /RestoreHealth
72+
- Insufficient Resources (RAM/CPU/Disk):
73+
- Action: Free up space, increase resources, move workloads as needed.
74+
- Pending Windows Updates:
75+
- Action: Install all pending updates and restart cluster nodes.
76+
- Dependent Services Not Running:
77+
- Action: Make sure that WMI, RemoteRegistry, and Cluster Service are running.
78+
- AV/Firewall Interference:
79+
- Action:Add exclusions for:
80+
- C:\Windows\Cluster
81+
- C:\ClusterStorage
82+
- Cluster binaries and SQL/Hyper-V binaries
83+
84+
### 2. Health Service event log errors or recurrent cluster health alerts
85+
86+
#### Symptoms
87+
88+
- Constant health warnings.
89+
- Event Log IDs: 5120/5121 (storage/network issue), 5126 (resource offline), WMI errors.
90+
91+
#### Cause and resolution
92+
93+
- Resource Dependency Failure:
94+
- Action: Verify all dependencies (disks, SMB, storage, replication) are online and healthy.
95+
- Network Partition:
96+
- Action: Use Cluster Validation and netstat to trace missing/broken communication.
97+
- Reconfigure or repair network adapters, review switch and routing logs.
98+
- Quorum Disk or Witness Failure:
99+
- Action: Check disk health, run chkdsk, replace or reconfigure the quorum or witness disk as needed.
100+
- Storage Spaces/S2D Pool Problems:
101+
- Action: Use PowerShell: Get-StoragePool, Get-VirtualDisk, Get-PhysicalDiskfor state.
102+
- Run pool optimization, add/replace failed disks.
103+
104+
### 3. High CPU and memory usage by Health Service
105+
106+
#### Symptoms
107+
108+
- "HealthService.exe" consumes excessive system resources.
109+
110+
#### Cause and resolution
111+
112+
- Many resources/monitors:
113+
- Action: Optimize monitored objects, remove unneeded performance counters, or split workloads across nodes.
114+
- Log file bloat or leaks:
115+
- Action: Archive or truncate logs, check for stuck transactions, clear temp directories.
116+
- Corrupted performance counters:
117+
- Action: Rebuild counters: lodctr /r
118+
119+
### 4. Health Service Certificate/Authentication Errors
120+
121+
#### Symptoms
122+
123+
- Events mentioning certificate issues, access denied (5), authentication or secure channel errors.
124+
125+
#### Cause and resolution
126+
- Expired/revoked certificate:
127+
- Action: Renew or replace Health Service and cluster certificates. Make sure that CRLs (Certificate Revocation Lists) are accessible.
128+
- Time skew:
129+
- Action: Make sure that NTP time sync across cluster nodes.w32tm /query /status
130+
- Mismatched Security/Authentication Protocols:
131+
- Action: Verify cluster Kerberos/NTLM protocols match (Windows Authentication must align across nodes).
132+
- Set GPO/registry settings as appropriate.
133+
134+
### 5. "Access Denied" or permissions/registry issues
135+
136+
#### Symptoms
137+
138+
- Health Service can't access cluster resources, system logs, or registry keys.
139+
- Specific error codes (for example, 5, 0x80070005).
140+
141+
#### Cause and resolution
142+
- Lack of permissions:
143+
- Action: Add the cluster service account and SYSTEM to local administrator and required resource permissions.
144+
- Computer/service account password problems:
145+
- Action: Reset secure channels withTest-ComputerSecureChannel -Repair -Verbose
146+
147+
### 6. WMI issues affect Health Service
148+
149+
#### Symptoms
150+
151+
- Errors involving WMI repository, management object access failures.
152+
- Cluster logs show WMI-related errors.
153+
154+
#### Cause and resolution
155+
156+
- Corrupted WMI Repository:
157+
- Action:
158+
159+
```console
160+
winmgmt /verifyrepository
161+
winmgmt /resetrepository
162+
mofcomp cluswmi.mof
163+
```
164+
165+
- Insufficient WMI namespace permissions:
166+
167+
- Action: Use "wmimgmt.msc" to verify security on root\cimv2 and cluster namespaces, Repair by using "wmimgmt" or PowerShell.
168+
169+
## Common issues quick reference table
170+
171+
| Symptom | Error/Event IDs | Cause | Resolution |
172+
| --- | --- | --- | --- |
173+
| Health Service offline | 7024, 7031, 0 | Binary corruption, missing dependency | Reinstall, check services |
174+
| Persistent health warnings | 5120, 5121, 5126 | Storage/network/quorum issue | Check disks, CSV/Network |
175+
| High CPU/Memory HealthService | - | Many monitors/perf counters, leaks | Optimize, clear counters |
176+
| Access Denied | 5, 0x80070005 | Permissions, AV, or ACL misconfiguration | Set permissions, AV excl. |
177+
| WMI repository errors | - | WMI repository corrupt | winmgmt /resetrepository |
178+
| SSL/certificate/auth failures | 7034, 1069, 1207 | Expired/missing certificates, time skew | Renew/re-import, sync NTP |
179+
| Health check triggers failover | 1676, 1135, 1177 | Missed heartbeat, IsAlive threshold, VSS | Optimize backup, net diag |
180+
| Service can't access registry | 86, 5126 | Missing Registry key or permission | Add/restore, ACL review |
181+
182+
## Data collection
183+
184+
Before you contact Microsoft Support, you can gather the following information about your issue.
185+
186+
- Cluster diagnostics logs: Get-ClusterLog -Destination \<path> -UseLocalTime
187+
- Service event logs:
188+
- Application, system, and FailoverClustering logs from Event Viewer.
189+
- Service state snapshots: Get-Service HealthService, Get-ClusterResource
190+
- Network and Storage Health: ipconfiguration /all, Test-NetConnection, Get-StoragePool, Get-VirtualDisk
191+
- Resource and role list: Get-ClusterGroup, Get-ClusterResource
192+
- WMI diagnostics: winmgmt /verifyrepository
193+
- Cluster validation report: Run cluster validation in Failover Cluster Manager.
194+
- Specific Error Screenshots and Resource Status.
195+
196+
Send logs and diagnostics to support or use secure workspace sharing per organizational policy.
197+
198+
## References
199+
200+
- [Windows Server Failover Cluster Health Service Documentation](/windows-server/failover-clustering/health-service-overview)
201+
- [Event ID and Diagnostic Reference](/windows/win32/eventlog/event-logging)
202+
- [Cluster Validation Wizard Guide](/troubleshoot/windows-server/high-availability/validate-hardware-failover-cluster?source=recommendations)
203+
- [Health Service reports](/windows-server/failover-clustering/health-service-reports)

0 commit comments

Comments
 (0)