|
| 1 | +--- |
| 2 | +title: Windows Server Failover Cluster Health Service Troubleshooting Guide |
| 3 | +description: Resolves issues that affect the Windows Server Failover Cluster Health Service. |
| 4 | +ms.date: 10/06/2025 |
| 5 | +author: kaushika-msft |
| 6 | +ms.author: kaushika |
| 7 | +manager: dcscontentpm |
| 8 | +audience: itpro |
| 9 | +ms.topic: troubleshooting |
| 10 | +ms.reviewer: kaushika |
| 11 | +ms.custom: |
| 12 | +- sap: high availability\health service |
| 13 | +- pcy: High availability\health service |
| 14 | +appliesto: |
| 15 | + - <a href=https://learn.microsoft.com/windows/release-health/windows-server-release-info target=_blank>Supported versions of Windows Server</a> |
| 16 | +--- |
| 17 | + |
| 18 | +# Windows Server failover cluster health service troubleshooting guide |
| 19 | + |
| 20 | +## Summary |
| 21 | + |
| 22 | +Windows Server Health Service is a key component of Windows Server, Failover Clustering, and System Center environments. It proactively monitors the health and availability of critical cluster resources, services, disks, storage, and network connectivity, and generates alerts to help administrators address faults before they lead to outages. |
| 23 | +If the Health Service malfunctions, doesn't start, or produces persistent warnings, it can mask underlying problems or generate false positives, endangering both availability and performance. |
| 24 | + |
| 25 | +This article provides a detailed checklist for troubleshooting Health Service issues. It covers diagnostic steps, common error scenarios, verified solutions, and tools for collecting data for root cause analysis. |
| 26 | + |
| 27 | +## Troubleshooting checklist |
| 28 | + |
| 29 | +Use this checklist for systematic troubleshooting: |
| 30 | + |
| 31 | +- Verify Health Service Status |
| 32 | + - Check the status in Failover Cluster Manager (mmc), Windows Event Viewer, and PowerShell: Get-Service -Name HealthServiceGet-ClusterResource | Where-Object {$_.ResourceType -eq "HealthService"} |
| 33 | +- Review Recent Alerts and Events |
| 34 | + - Inspect the System, Application, and FailoverClustering event logs for errors or recurrent warnings (Event Viewer). |
| 35 | + - Note Event IDs such as 0, 7024, 7031, 7034, 5120, 5121, 5126. |
| 36 | +- Check Cluster Resource State |
| 37 | + - Verify that the Health Service is online (green state) in the cluster dashboard. |
| 38 | + - Make sure that no dependent resources or critical services are failing. |
| 39 | +- Assess System Prerequisites |
| 40 | + - Verify that cluster nodes have consistent OS and update levels. |
| 41 | + - Sufficient disk space, memory, and network connectivity exist. |
| 42 | + - Antivirus or security software exclusions are configured for cluster paths and binaries. |
| 43 | +- Review Recent Changes |
| 44 | + - Any recent updates, upgrades, network changes, cluster configuration modifications, or disk additions? |
| 45 | + - Roll back or validate the impact of recent updates if symptoms began after changes. |
| 46 | +- Examine Network and Firewall Settings |
| 47 | + - Make sure that required ports (both cluster and Health Service) are open between nodes (for example, TCP/UDP 5985, 135, 445, 3343). |
| 48 | +- Verify Disk and Storage Health |
| 49 | + - Check CSV, quorum disks, or witness disk state. |
| 50 | + - Run chkdsk and review health diagnostics. |
| 51 | +- Synchronize Time and Certificates |
| 52 | + - NTP is properly configured; no significant clock drift across cluster nodes. |
| 53 | + - Certificates used by Health Service (if any) are valid and not expired. |
| 54 | +- Run Cluster Validation |
| 55 | + - Use the Cluster Validation Wizard in Failover Cluster Manager for automated checks, focusing on storage, networking, and system configuration. |
| 56 | + |
| 57 | +## Common issues and solutions |
| 58 | + |
| 59 | +### 1. Health service doesn't start or goes offline |
| 60 | + |
| 61 | +#### Symptoms |
| 62 | + |
| 63 | +- Failover Cluster Manager shows the Health Service as offline or in a failed state. |
| 64 | +- Event IDs: 7024, 7031, 0 (service terminated unexpectedly), Cluster event 1205/1069. |
| 65 | + |
| 66 | +#### Cause and resolution |
| 67 | + |
| 68 | +- Corrupted Health Service Binary/Install: |
| 69 | + - Action: Uninstall and reinstall the Health Service role/component. |
| 70 | +- System Integrity Violation: |
| 71 | + - Action: Run sfc /scannow and DISM repair on each node.DISM /Online /Cleanup-Image /RestoreHealth |
| 72 | +- Insufficient Resources (RAM/CPU/Disk): |
| 73 | + - Action: Free up space, increase resources, move workloads as needed. |
| 74 | +- Pending Windows Updates: |
| 75 | + - Action: Install all pending updates and restart cluster nodes. |
| 76 | +- Dependent Services Not Running: |
| 77 | + - Action: Make sure that WMI, RemoteRegistry, and Cluster Service are running. |
| 78 | +- AV/Firewall Interference: |
| 79 | + - Action:Add exclusions for: |
| 80 | + - C:\Windows\Cluster |
| 81 | + - C:\ClusterStorage |
| 82 | + - Cluster binaries and SQL/Hyper-V binaries |
| 83 | + |
| 84 | +### 2. Health Service event log errors or recurrent cluster health alerts |
| 85 | + |
| 86 | +#### Symptoms |
| 87 | + |
| 88 | +- Constant health warnings. |
| 89 | +- Event Log IDs: 5120/5121 (storage/network issue), 5126 (resource offline), WMI errors. |
| 90 | + |
| 91 | +#### Cause and resolution |
| 92 | + |
| 93 | +- Resource Dependency Failure: |
| 94 | + - Action: Verify all dependencies (disks, SMB, storage, replication) are online and healthy. |
| 95 | +- Network Partition: |
| 96 | + - Action: Use Cluster Validation and netstat to trace missing/broken communication. |
| 97 | + - Reconfigure or repair network adapters, review switch and routing logs. |
| 98 | +- Quorum Disk or Witness Failure: |
| 99 | + - Action: Check disk health, run chkdsk, replace or reconfigure the quorum or witness disk as needed. |
| 100 | +- Storage Spaces/S2D Pool Problems: |
| 101 | + - Action: Use PowerShell: Get-StoragePool, Get-VirtualDisk, Get-PhysicalDiskfor state. |
| 102 | + - Run pool optimization, add/replace failed disks. |
| 103 | + |
| 104 | +### 3. High CPU and memory usage by Health Service |
| 105 | + |
| 106 | +#### Symptoms |
| 107 | + |
| 108 | +- "HealthService.exe" consumes excessive system resources. |
| 109 | + |
| 110 | +#### Cause and resolution |
| 111 | + |
| 112 | +- Many resources/monitors: |
| 113 | + - Action: Optimize monitored objects, remove unneeded performance counters, or split workloads across nodes. |
| 114 | +- Log file bloat or leaks: |
| 115 | + - Action: Archive or truncate logs, check for stuck transactions, clear temp directories. |
| 116 | +- Corrupted performance counters: |
| 117 | + - Action: Rebuild counters: lodctr /r |
| 118 | + |
| 119 | +### 4. Health Service Certificate/Authentication Errors |
| 120 | + |
| 121 | +#### Symptoms |
| 122 | + |
| 123 | +- Events mentioning certificate issues, access denied (5), authentication or secure channel errors. |
| 124 | + |
| 125 | +#### Cause and resolution |
| 126 | +- Expired/revoked certificate: |
| 127 | + - Action: Renew or replace Health Service and cluster certificates. Make sure that CRLs (Certificate Revocation Lists) are accessible. |
| 128 | +- Time skew: |
| 129 | + - Action: Make sure that NTP time sync across cluster nodes.w32tm /query /status |
| 130 | +- Mismatched Security/Authentication Protocols: |
| 131 | + - Action: Verify cluster Kerberos/NTLM protocols match (Windows Authentication must align across nodes). |
| 132 | + - Set GPO/registry settings as appropriate. |
| 133 | + |
| 134 | +### 5. "Access Denied" or permissions/registry issues |
| 135 | + |
| 136 | +#### Symptoms |
| 137 | + |
| 138 | +- Health Service can't access cluster resources, system logs, or registry keys. |
| 139 | +- Specific error codes (for example, 5, 0x80070005). |
| 140 | + |
| 141 | +#### Cause and resolution |
| 142 | +- Lack of permissions: |
| 143 | + - Action: Add the cluster service account and SYSTEM to local administrator and required resource permissions. |
| 144 | +- Computer/service account password problems: |
| 145 | + - Action: Reset secure channels withTest-ComputerSecureChannel -Repair -Verbose |
| 146 | + |
| 147 | +### 6. WMI issues affect Health Service |
| 148 | + |
| 149 | +#### Symptoms |
| 150 | + |
| 151 | +- Errors involving WMI repository, management object access failures. |
| 152 | +- Cluster logs show WMI-related errors. |
| 153 | + |
| 154 | +#### Cause and resolution |
| 155 | + |
| 156 | +- Corrupted WMI Repository: |
| 157 | + - Action: |
| 158 | + |
| 159 | + ```console |
| 160 | + winmgmt /verifyrepository |
| 161 | + winmgmt /resetrepository |
| 162 | + mofcomp cluswmi.mof |
| 163 | + ``` |
| 164 | + |
| 165 | +- Insufficient WMI namespace permissions: |
| 166 | + |
| 167 | + - Action: Use "wmimgmt.msc" to verify security on root\cimv2 and cluster namespaces, Repair by using "wmimgmt" or PowerShell. |
| 168 | + |
| 169 | +## Common issues quick reference table |
| 170 | + |
| 171 | +| Symptom | Error/Event IDs | Cause | Resolution | |
| 172 | +| --- | --- | --- | --- | |
| 173 | +| Health Service offline | 7024, 7031, 0 | Binary corruption, missing dependency | Reinstall, check services | |
| 174 | +| Persistent health warnings | 5120, 5121, 5126 | Storage/network/quorum issue | Check disks, CSV/Network | |
| 175 | +| High CPU/Memory HealthService | - | Many monitors/perf counters, leaks | Optimize, clear counters | |
| 176 | +| Access Denied | 5, 0x80070005 | Permissions, AV, or ACL misconfiguration | Set permissions, AV excl. | |
| 177 | +| WMI repository errors | - | WMI repository corrupt | winmgmt /resetrepository | |
| 178 | +| SSL/certificate/auth failures | 7034, 1069, 1207 | Expired/missing certificates, time skew | Renew/re-import, sync NTP | |
| 179 | +| Health check triggers failover | 1676, 1135, 1177 | Missed heartbeat, IsAlive threshold, VSS | Optimize backup, net diag | |
| 180 | +| Service can't access registry | 86, 5126 | Missing Registry key or permission | Add/restore, ACL review | |
| 181 | + |
| 182 | +## Data collection |
| 183 | + |
| 184 | +Before you contact Microsoft Support, you can gather the following information about your issue. |
| 185 | + |
| 186 | +- Cluster diagnostics logs: Get-ClusterLog -Destination \<path> -UseLocalTime |
| 187 | +- Service event logs: |
| 188 | + - Application, system, and FailoverClustering logs from Event Viewer. |
| 189 | +- Service state snapshots: Get-Service HealthService, Get-ClusterResource |
| 190 | +- Network and Storage Health: ipconfiguration /all, Test-NetConnection, Get-StoragePool, Get-VirtualDisk |
| 191 | +- Resource and role list: Get-ClusterGroup, Get-ClusterResource |
| 192 | +- WMI diagnostics: winmgmt /verifyrepository |
| 193 | +- Cluster validation report: Run cluster validation in Failover Cluster Manager. |
| 194 | +- Specific Error Screenshots and Resource Status. |
| 195 | + |
| 196 | +Send logs and diagnostics to support or use secure workspace sharing per organizational policy. |
| 197 | + |
| 198 | +## References |
| 199 | + |
| 200 | +- [Windows Server Failover Cluster Health Service Documentation](/windows-server/failover-clustering/health-service-overview) |
| 201 | +- [Event ID and Diagnostic Reference](/windows/win32/eventlog/event-logging) |
| 202 | +- [Cluster Validation Wizard Guide](/troubleshoot/windows-server/high-availability/validate-hardware-failover-cluster?source=recommendations) |
| 203 | +- [Health Service reports](/windows-server/failover-clustering/health-service-reports) |
0 commit comments