Skip to content

Commit dd5aa77

Browse files
authored
AB#7444: Create cluster-node-quarantine-troubleshooting.md (#10080)
* Create cluster-node-quarantine-troubleshooting.md * Update cluster-node-quarantine-troubleshooting.md * Update cluster-node-quarantine-troubleshooting.md
1 parent 5871c69 commit dd5aa77

File tree

1 file changed

+165
-0
lines changed

1 file changed

+165
-0
lines changed
Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
---
2+
title: Cluster Node Quarantine Troubleshooting Guide
3+
description: Resolves issues cause a cluster node quarantine to temporarily isolate an unstable or problematic node.
4+
ms.date: 10/06/2025
5+
author: kaushika-msft
6+
ms.author: kaushika
7+
manager: dcscontentpm
8+
audience: itpro
9+
ms.topic: troubleshooting
10+
ms.reviewer: kaushika
11+
ms.custom:
12+
- sap: virtualization and hyper-v\backup and restore of virtual machines
13+
- pcy: Virtualization\backup and restore of virtual machines
14+
appliesto:
15+
- <a href=https://learn.microsoft.com/windows/release-health/windows-server-release-info target=_blank>Supported versions of Windows Server</a>
16+
---
17+
18+
# Cluster node quarantine troubleshooting guide
19+
20+
## Summary
21+
22+
Cluster node quarantine is a protective feature in Windows Server Failover Clustering that temporarily isolates unstable or problematic nodes to safeguard cluster health and application workloads. A quarantined node can't host cluster roles, such as virtual machines (VMs), until it exits quarantine, either automatically or through administrator intervention. Several issues can trigger a quarantine, including repeated health check failures, persistent service failures, hardware issues, or storage or network communication problems. This article provides a thorough approach to investigating and resolving node quarantine incidents to help administrators ensure cluster stability and minimal downtime.
23+
24+
Troubleshooting checklist
25+
26+
Use this checklist for systematic troubleshooting:
27+
28+
- Identify Quarantined Nodes
29+
- Use Failover Cluster Manager orGet-ClusterNodePowerShell cmdlet.
30+
- Review Cluster Events and Error Messages
31+
- Check Event Viewer (System/Cluster logs) for quarantine reasons, for example Event IDs 1641, 1647, 1649.
32+
- Examine recent application, system, and failover clustering logs.
33+
- Check Node Status and Resource Hosting Subsystem (RHS)
34+
- Check whether RHS or core services are repeatedly crashing.
35+
- Review service health and recent restarts.
36+
- Assess Network Connectivity
37+
- Verify that the node can communicate with other cluster members (ICMP/ping, ports 3343 and others).
38+
- Verify no packet loss, MTU mismatches, firewall blocks.
39+
- Verify cluster storage connectivity
40+
- Check for access to Cluster Shared Volumes (CSVs) and disk resources.
41+
- Look for Event IDs 5120, 5142 (CSV paused/disconnected).
42+
- Check system resource health
43+
- Investigate disk status, memory, CPU utilization, hardware alerts.
44+
- Run diagnostics for hardware issues.
45+
- Cluster configuration review
46+
- Validate node certificates, static IP, cluster network settings, quorum configuration.
47+
- Examine security software or antivirus
48+
- Determine whether security agent is interfering with cluster traffic or services.
49+
- Temporarily disable to test stability.
50+
- Collect Relevant Data for Analysis
51+
- Cluster logs (Get-ClusterLog), event logs, network trace, Crash dumps.
52+
53+
Common issues and solutions
54+
55+
1. Quarantine because of repeated RHS/service failures
56+
57+
#### Symptoms
58+
59+
- Node appears in "Quarantine" state.
60+
- Event IDs: 1641 ("Node quarantine activated"), 1647, repeated 7031 (Service Control Manager: unexpected service termination).
61+
- Cluster roles continuously fail over from the node. VMs may repeatedly restart or move.
62+
63+
Resolution:
64+
- Review failure details in Cluster.log and System/Application event logs.
65+
- If a specific process (for example, MsSense.exe, rhs.exe) is implicated:
66+
- Obtain memory dump files of the process. Analyze for deadlocks or memory leaks.
67+
- Remove or disable problematic VM/service.
68+
- Apply the latest updates for Windows, Hyper-V, and third-party software.
69+
- Add antivirus exclusions for .avhdx, .vhdx, and cluster-related files.
70+
- Check for known bugs (for example, WDATP/MsSense.exe memory leak), and apply a MSFT hotfix, if it's available.
71+
72+
2. Network communication issues
73+
74+
#### Symptoms
75+
76+
- Node is quarantined. Network events show dropped packets or heartbeat failures.
77+
- Event IDs: 1135 (Node removed from cluster), 1177, warnings about broken communication channels (status 10054).
78+
- Other nodes show failed connection attempts to quarantined node (port 3343).
79+
80+
#### Resolution
81+
82+
- Make sure that all required cluster ports (UDP 3343, SMB, TCP 6600 for live migration) are open and not blocked by firewall/security software.
83+
- Run Test-NetConnection for key ports.
84+
- Check for MTU mismatches. Set consistent MTU values across all cluster adapter settings.
85+
- Update network adapter drivers/firmware. Verify the RDMA configuration, if used.
86+
- If antivirus/firewall is blocking traffic, add exceptions, or temporarily disable for testing.
87+
88+
3. Storage/CSV issues that cause quarantine
89+
90+
#### Symptoms
91+
92+
- Event IDs 5120, 5142, CSV paused or disconnected, node enters quarantine after failure.
93+
- VM disks or CSVs inaccessible from quarantined node.
94+
- Event logs show storage path failures, degraded MPIO, or controller resets.
95+
96+
#### Resolution
97+
- Verify the health of the storage subsystem (use vendor tools and logs).
98+
- Reconnect or restart storage paths, check for persistent disks.
99+
- Exclude cluster storage directories from antivirus.
100+
- Run Get-ClusterSharedVolumeState on affected nodes.
101+
- Verify MPIO, check failover/failback logs, apply the latest firmware or storage drivers.
102+
103+
4. Security platform/antivirus interference
104+
105+
#### Symptoms
106+
107+
- Cluster communication dropped by third-party security software.
108+
- Cluster event logs or TCP/IP diagnostic logs indicate silent packet drops.
109+
- Procmon shows blocked cluster service execution.
110+
111+
#### Resolution
112+
113+
- Disable/remove offending security software. Retest cluster stability.
114+
- Reinstall after you verify that cluster communication works.
115+
- Add cluster service, process, and directories to software exclusion list.
116+
117+
5. Configuration and infrastructure gaps
118+
119+
#### Symptoms
120+
121+
- Quarantine initiated after node restart or updating.
122+
- IP misconfigurations occur (DHCP used instead of static), and cluster certificate is missing.
123+
- Quorum witness resource repeatedly fails.
124+
125+
#### Resolution
126+
127+
- Set cluster IP resource to static, not DHCP (Set-ClusterParameter).
128+
- Verify cluster node certificates; export/import from healthy node if missing.
129+
- Validate and repair quorum witness configuration (disk/file share witness settings).
130+
- Use Update-ClusterNetworkNameResource to refresh network name resources.
131+
132+
6. Recovering from quarantine state
133+
134+
General Recovery Steps:
135+
136+
- Use PowerShell to clear node quarantine: Start-ClusterNode -Name \<NodeName> -ClearQuarantineCopy.
137+
- Evict and re-add node; restart cluster service, if it's necessary.
138+
- Restart node after clearing issues (hardware, network, service-related).
139+
- Monitor logs for confirmation: node rejoins cluster normally.
140+
141+
Data collection
142+
143+
Before you contact Microsoft Support, you can gather the following information about your issue.
144+
145+
- Cluster Logs: Get-ClusterLog -Destination \<FolderPath> -UseLocal -TimeSpan \<Minutes>Copy
146+
- Event Logs: Export System, Application, and FailoverClustering logs.
147+
- Network Trace: Netsh trace start scenario=GENERAL capture=yes tracefile=\<path>Copy
148+
- Process dumps files: As needed, using Sysinternals or built-in Windows tools.
149+
- Storage and hardware diagnostics: Collect through vendor tool output.
150+
- Security software logs: Export relevant security agent filtering and action logs.
151+
152+
Common issues quick reference table
153+
154+
| Issue | Key symptoms | Resolution | Reference event IDs |
155+
| --- | --- | --- | --- |
156+
| RHS/service failure | Quarantine state, VM failovers, crash logs | Analyze dumps, update/update, AV exclusions, remove problematic service/VM | 1641, 7031 |
157+
| Network communication error | Node removal, dropped packets, port failures | Open cluster ports, fix MTU, update NIC drivers, AV/firewall exclusions | 1135, 1177, 10054 |
158+
| Storage/CSV Access Failure | CSV paused/disconnected, VM disk I/O errors | Storage vendor analysis, update firmware/drivers, fix MPIO, AV exclusions | 5120, 5142 |
159+
| Antivirus/security platform issues | Packet dropped by filter, service block | Disable/remove AV/Security, set exclusions, reinstall as necessary | Varies |
160+
| Configuration/quorum issues | Quorum witness fail, IP/certificate errors | Set static IPs, fix certificates, rebuild quorum witness, update resources | 1069, 1558, 121 |
161+
| Recovery steps (general) | Can't host roles, node stays quarantined | Start-ClusterNode -ClearQuarantine, evict/re-add, restart, monitor logs | 1641 |
162+
163+
#### References
164+
165+
- [AV Exclusions for Server and Hyper-V](/troubleshoot/windows-server/virtualization/antivirus-exclusions-for-hyper-v-hosts)

0 commit comments

Comments
 (0)