Skip to content

Commit 15c80a8

Browse files
authored
Merge pull request #352 from thomas-dkmt/doc-high-availability
Complete documentation of High availability + add HA troubleshooting section
2 parents 73c22ca + 58c137e commit 15c80a8

File tree

4 files changed

+202
-32
lines changed

4 files changed

+202
-32
lines changed
10.1 KB
Loading

docs/assets/img/xo-ha-selector.png

4.62 KB
Loading

docs/management/ha.md

Lines changed: 142 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,106 +1,216 @@
11
# High availability
22

3-
High availability (or HA) in XCP-ng world is the ability to detect a failed host and automatically boot all the VMs that were running on this host to the other alive machines.
3+
In XCP-ng, high availability (or HA) is the ability to detect a failed host and automatically boot all the VMs that were running on this host to the other alive machines.
44

55
## 📋 Introduction
66

7-
Implementing VM High availability (HA) is a real challenge: first because you need to reliably detect when a server has really failed to avoid unpredictable behavior. But that's not the only one. If you lose the network link but not the shared storage, how to ensure you will not write simultaneously on the storage and thus corrupt all your data?
7+
Implementing VM High availability (HA) is a real challenge.
8+
9+
First, because you need to reliably detect when a server has really failed to avoid unpredictable behavior.
10+
11+
But that's not the only one. If you lose the network link but not the shared storage, how to ensure you will not write simultaneously on the storage and corrupt all your data as a result?
812

913
We'll see how to protect your precious VM in multiple cases, and we'll illustrate that with real examples.
1014

1115
:::info
12-
Even if you can have HA with only 2 hosts, it's strongly recommended to do it with at least 3 hosts, for obvious split-brains issues you might encounter.
16+
You can have high availability with as few as 2 hosts, but we strongly recommended to do it with 3 at the minimum, for obvious split-brain issues you might encounter.
1317
:::
1418

1519
:::warning
16-
HA requires **far more maintenance** and will create some traps if you are not aware. In short, it comes at a cost. Before using it, **please carefully think about it**: do you **REALLY** need it? We saw people having less uptime because of HA than without. Because you **must understand** what you are doing every time you are rebooting or updating a host.
20+
High availability requires **far more maintenance** and will create some traps if you are not aware. In short, it comes at a cost.
21+
22+
Before using it, **please think about it carefully**: do you **REALLY** need it? We've seen people having less uptime using HA than when not using it, because you **must understand** what you are doing every time you reboot or update a host.
1723
:::
1824

1925
## 🎓 Concepts
2026

2127
The pool concept allows hosts to exchange their data and status:
2228

23-
* if you lose a host, it will be detected by the pool master.
24-
* if you lose the master, another host will take over the master role automatically.
29+
* If you lose a host, that will be detected by the pool master.
30+
* If you lose the master, another host will take over the master role automatically.
31+
32+
To be sure a host is really unreachable, HA in XCP-ng uses multiple heartbeat mechanisms. As you saw in the introduction, it's not enough to check the network: what about storage? That's why there is also a specific heartbeat for shared storage between hosts in a pool. In fact, each host regularly writes some blocks in a dedicated VDI. That's the principle of the [Dead man's switch](http://en.wikipedia.org/wiki/Dead_man%27s_switch).
33+
34+
This concept is important, and it explains why you need to **configure high availability with a shared storage** (iSCSI, Fiber Channel or NFS) to avoid simultaneous writing in a VM disk.
35+
36+
Here are the possible cases and how they are dealt with:
37+
38+
* **Lost both network and storage heartbeat**: the host is considered unreachable and the HA plan is started
39+
* **Lost storage but not network**: if the host can contact a majority of pool members, it can stay alive. Indeed, in this scenario, there is no harm for the data (can't write to the VM disks). If the host is alone (i.e can't contact any other host or less than the majority), it goes for a reboot procedure.
40+
* **Lost network but not storage (worst case!)**: the host considers itself as problematic and starts a reboot procedure (hard power off and restart). This fencing procedure guarantees the sanity of your data.
41+
42+
## Requirements
43+
44+
Enabling HA in XCP-ng requires thorough planning and validation of several prerequisites:
45+
46+
- **Pool-Level HA only**: HA can only be configured at the pool level, not across different pools.
47+
- **Minimum of 3 hosts recommended**: While HA can function with just 2 XCP-ng servers in a pool, we recommend using **at least 3** to prevent issues such as a split-brain scenario. With only 2 hosts, they risk getting fenced if the connection between them is lost.
48+
- **Shared storage requirements**: You must have shared storage available, including at least one iSCSI, NFS, XOSTOR or Fiber Channel LUN with a minimum size of **356 MB for the heartbeat Storage Repository (SR)**. The HA mechanism creates two volumes on this SR:
49+
- A **4 MB heartbeat volume** for monitoring host status.
50+
- A **256 MB metadata volume** to store pool master information for master failover situations.
51+
- **Dedicated heartbeat SR optional**: While it's not necessary to dedicate a separate SR for the heartbeat, you can choose to do so. Alternatively, you can use the same SR that hosts your VMs.
52+
- **Unsupported storage for heartbeat**: Storage using SMB or iSCSI authenticated via CHAP **cannot be used as the heartbeat SR**.
53+
- **Static IP addresses**: Make sure that all hosts have static IP addresses to avoid disruptions from DHCP servers potentially reassigning IPs.
54+
- **Dedicated bonded interface recommended**: For optimal reliability, we recommend using a dedicated bonded interface for the HA management network.
55+
- **VM agility for HA protection**: For a VM to be protected by HA, it must meet certain agility requirements:
56+
- The VM’s virtual disks must **reside on shared storage**, such as iSCSI, NFS, or Fibre Channel LUN, which is also necessary for the storage heartbeat.
57+
- The VM must **support live migration**.
58+
- The VM should **not have a local DVD drive connection configured**.
59+
- The VM’s network interfaces should be on **pool-wide networks**.
60+
61+
:::tip
62+
For enabling HA, we **strongly recommend** to use a bonded management interface for servers in the pool, and to configure multipathed storage for the heartbeat SR.
63+
:::
64+
2565

26-
To be sure a host is really unreachable, HA in XCP-ng uses multiple heartbeat mechanisms. As you saw in the introduction, just checking the network isn't enough: what about the storage? That's why there is also a specific heartbeat for shared storage between hosts in a pool. In fact, each host regularly write some blocks in a dedicated VDI. That's the principle of the [Dead man switch](http://en.wikipedia.org/wiki/Dead_man%27s_switch). This concept is important, and it explains why you need to **configure HA with a shared storage** (iSCSI, Fiber Channel or NFS) to avoid simultaneous writing in a VM disk.
66+
If you create VLANs and bonded interfaces via the CLI, they might not be active or properly connected, causing a VM to appear non-agile and, therefore, unprotected by HA.
2767

28-
Here is the possibles cases and their answers:
68+
Use the `pif-plug` command in the CLI to activate VLAN and bond PIFs, ensuring the VM becomes agile.
69+
70+
Additionally, the `xe diagnostic-vm-status` CLI command can help identify why a VM isn’t agile, allowing you to take corrective action as needed.
2971

30-
* **lost both network and storage heartbeat**: host is considered unreachable, HA plan is started
31-
* **lost storage but not network**: if the host can contact majority of pool members, it can stay alive. Indeed, in this scenario, there is no harm for the data (can't write to the VM disks). If the host is alone, i.e can't contact any other host or less than the majority, it decides to go for a reboot procedure.
32-
* **lost network but not storage (worst case!)**: the host considers itself as problematic, and start a reboot procedure (hard poweroff and restart). This fencing procedure guarantee the sanity of your data.
3372

3473
## ⚙️ Configuration
3574

3675
### Prepare the pool
3776

38-
You can check if your pool have HA enable or not. In Xen Orchestra, you'll have a small "cloud" icon in the Home/pool view for each pool with HA enabled.
77+
You can check if your pool has HA enabled or not.
78+
79+
* In Xen Orchestra, for each pool where HA has been enabled, go to the **Home → Pool** view and you'll see a small "cloud" icon with a green check.
80+
* In the **Pool → Advanced** tab, you'll see a **High Availability** switch that shows if HA is enabled or not:
81+
82+
![](../assets/img/xo-ha-enabled-disabled.png)
83+
84+
To enable HA, just toggle it on, which gives you a SR selector as Heartbeat SR.
3985

40-
You can enable it with this xe CLI command:
86+
You can also enable it with this xe CLI command:
4187

4288
```
4389
xe pool-ha-enable heartbeat-sr-uuids=<SR_UUID>
4490
```
4591

4692
:::tip
47-
Remember that you need to use a shared SR to enable HA.
93+
Remember that you need to use a shared storage repository to enable high availability.
4894
:::
4995

96+
Once enabled, HA status will be displayed with the green toggle.
97+
5098
### Maximum host failure number
5199

52-
How many host failures you can tolerate before running out of options? For 2 hosts in a Pool, the answer is pretty simple: **1** is the maximum number. Why? Well, after loosing one host, it will be impossible to ensure a HA policy of the last one also fails.
100+
How many host failures you can tolerate before running out of options? For 2 hosts in a Pool, the answer is pretty simple: **1** is the maximum number: after losing one host, it will be impossible to ensure a HA policy if the last one also fails.
53101

54-
This value can be computed by XCP-ng, and in our example case:
102+
XCP-ng can calculate this value for you. In our sample case, it looks like this:
55103

56104
```
57105
xe pool-ha-compute-max-host-failures-to-tolerate
58106
1
59107
```
60108

61-
But it could be also **0**. Because even if you lose 1 host, is there not enough RAM to boot the HA VM on the last one? If not, you can't ensure their survival. If you want to set the number yourself, you can do it with this command:
109+
But it could be also **0**. Because, even if you lose 1 host, is there not enough RAM to boot the HA VM on the last one? If not, you can't ensure their survival.
110+
111+
If you want to set the number yourself, you can do it with this command:
62112

63113
```
64114
xe pool-param-set ha-host-failures-to-tolerate=1 uuid=<Pool_UUID>
65115
```
66116

67-
When you have more hosts failed equal to this number, a system alert is raised: you are in a **over-commitment** situation.
117+
If more hosts fail than this number, the system will raise an **over-commitment** alert.
68118

69119
### Configure a VM for HA
70120

71-
This is pretty straightforward with Xen Orchestra. Go on your VM page, then edit the *Advanced* panel: you just have to tick the HA checkbox.
121+
#### VM High availability modes
122+
123+
In XCP-ng, you can choose between 3 high availability modes: restart, best-effort, and disabled:
124+
125+
- **Restart**: if a protected VM cannot be immediately restarted after a server failure, HA will attempt to restart the VM when additional capacity becomes available in the pool. However, there is no guarantee that this attempt will be successful.
126+
- **Best-Effort**: for VMs configured with best-effort, HA will try to restart them on another host if their original host goes offline.\
127+
This attempt will only occur after all VMs set to the "restart" mode have been successfully restarted. HA will make only one attempt to restart a best-effort VM. If it fails, no further attempts will be made.
128+
- **Disabled**: if an unprotected VM or its host is stopped, HA will not attempt to restart the VM.
129+
130+
#### Choosing a high availability mode
131+
132+
This is pretty straightforward with Xen Orchestra. Go to the **Advanced** panel of your VM page and use the **HA** dropdown menu:
133+
134+
![](../assets/img/xo-ha-selector.png)
72135

73136
You can also do that configuration with *xe CLI*:
74137

75138
```
76139
xe vm-param-set uuid=<VM_UUID> ha-restart-priority=restart
77140
```
78141

142+
#### Start order
143+
144+
##### What's the start order?
145+
146+
The start order defines the sequence in which XCP-ng HA attempts to restart protected VMs following a failure. The order property of each protected VM determines this sequence.
147+
148+
##### How and when does it apply?
149+
150+
While the order property can be set for any VM, HA only uses it for VMs marked as **protected**.
151+
152+
The order value is an integer, with the default set to **0**, indicating the **highest priority**. VMs with an order value of 0 are restarted first, and those with higher values are restarted later in the sequence.
153+
154+
##### How do I set the start order?
155+
156+
You can set the order property value of a VM via the command-line interface:
157+
158+
```
159+
xe vm-param-set uuid=<VM UUID> order=<number>
160+
```
161+
162+
#### Configure HA timeout
163+
164+
##### What's the HA timeout?
165+
166+
The HA timeout represents the duration during which networking or storage might be inaccessible to the hosts in your pool.
167+
168+
If any XCP-ng server cannot regain access to networking or storage within the specified timeout period, it may self-fence and restart.
169+
170+
##### How do I configure it?
171+
172+
The **default timeout is 60 seconds**, but you can adjust this value using the following command to suit your needs:
173+
174+
```
175+
xe pool-param-set uuid=<pool UUID> other-config:default_ha_timeout=<timeout in seconds>
176+
```
177+
79178
## 🔧 Updates/maintenance
80179

81-
Before any update or host maintenance, planned reboot and so on, you need to **ALWAYS** put your host in maintenance mode. If you don't do it, XAPI will consider it as a unplanned failure, and will act accordingly.
180+
Before any update or host maintenance, planned reboot and so on, **ALWAYS** put your host in maintenance mode. If you don't do that, XAPI will think it's an unplanned failure, and will act accordingly.
82181

83-
If you have enough memory to put one host in maintenance (migrating all its VMs to other member of the pool), that will be alright. If you don't, you'll need to shutdown VMs manually **from a XAPI client** (Xen Orchestra or `xe`), and **NOT from inside the operating system**.
182+
If you have enough memory to put one host in maintenance (migrating all its VMs to another member of the pool), that will be alright. If you don't, you'll need to shut VMs down manually **from a XAPI client** (Xen Orchestra or `xe`), and **NOT from inside the operating system**.
84183

85184
:::warning
86-
You **must be very careful before ANY maintenance task**, otherwise HA will kick in and provide unpleasant surprises. You have been warned.
185+
**Be very careful before doing ANY maintenance task**, otherwise HA will kick in and provide unpleasant surprises. You have been warned.
87186
:::
88187

89188
## ↔️ Behavior
90189

91190
### Halting the VM
92191

93-
If you decide to shutdown the VM with `Xen Orchestra` or `xe`, the VM will be stopped normally, because XCP-ng knows that's what you want.
192+
If you shut the VM down with `Xen Orchestra` or `xe`, the VM will be stopped normally, because XCP-ng knows that's what you want.
193+
194+
However, if you halt the VM directly in the guest OS (via the console or in SSH), XCP-ng is NOT aware of what's going on. The system will think the VM is down and will consider that an anomaly. As a result, the VM will be **started automatically!**. This behavior prevents an operator from shutting down the system and leaving the VM unavailable for a long time.
94195

95-
But if you halt it directly in the guest OS (via the console or in SSH), XCP-ng is NOT aware of what's going on. For the system, it seems the VM is down and that's an anomaly. So, the VM will be **started automatically!**. This behavior prevent an operator to shutdown the system and leave the VM unavailable for a long time.
196+
:::tip
197+
198+
Starting with XAPI 25.16.0, VM restart behavior can be changed. To do this, run this command:
199+
200+
```
201+
xe pool-param-set uuid=... ha-reboot-vm-on-internal-shutdown=false
202+
```
203+
As a result, VMs that are shut down internally or through the API will restart the exact same way.
204+
205+
:::
96206

97207
### Host failure
98208

99-
We'll see 3 different scenarios on the host, with an example on 2 hosts, **lab1** and **lab2**:
209+
We'll see 3 different scenarios for the host, with an example on 2 hosts, **lab1** and **lab2**:
100210

101-
* physically "hard" shutdown the server
102-
* physically remove the storage connection
103-
* physically remove the network connection
211+
* Physically power off the server.
212+
* Physically remove the **storage** connection.
213+
* Physically remove the **network** connection.
104214

105215
**lab1** is not the *Pool Master*, but the results would be the same (just longer to test because of time to the other host becoming the master itself).
106216

@@ -110,13 +220,13 @@ After each test, **Minion 1** go back to **lab1** to start in the exact same con
110220

111221
#### Pull the power plug
112222

113-
Now, we will decide to pull the plug for my host **lab1**: this is exactly where my VM currently runs. After some time (when XAPI detect and report the lost of the host, in general 2 minutes), we can see that **lab1** is reported as Halted. In the same time, the VM **Minion 1** is booted on the other host running, **lab 2**:
223+
Now, we will decide to pull the plug for my host **lab1**: this is exactly where my VM currently runs. After some time (when XAPI detects and reports that the host is lost, which usually takes 2 minutes), we can see that **lab1** is reported as **Halted**. In the same time, the VM **Minion 1** is booted on the other running host - **lab 2**:
114224

115-
If you decide to re-plug the host **lab1**, the host will be back online, without any VM on it, which is normal.
225+
If you decide to re-plug the **lab1** host, it will be back online, without any VM on it, which is normal.
116226

117227
#### Pull the storage cable
118228

119-
Another scenario: this time, we will unplug the iSCSI/NFS link on **lab1**, despite **Minion 1** is running on it.
229+
Another scenario: this time, we will unplug the iSCSI/NFS link on **lab1**, even though **Minion 1** is running on it.
120230

121231
So? **Minion 1** lost access to its disks and after some time, **lab1** saw it can't access the heartbeat disk. Fencing protection is activated! The machine is rebooted, and after that, any `xe CLI` command on this host will give you that message:
122232

@@ -132,4 +242,4 @@ Immediatly after fencing, **Minion 1** will be booted on the other host.
132242

133243
#### Pull the network cable
134244

135-
Finally, the worst case: leaving the storage operational but "cut" the (management) network interface. Same procedure: unplug physically the cable, and wait... Because **lab1** can't contact any other host of the pool (in this case, **lab2**), it decides to start the fencing procedure. The result is exaclty the same as the previous test. It's gone for the pool master, displayed as "Halted" until we re-plug the cable.
245+
Finally, the worst case: keep the storage operational, but "cut" the (management) network interface. Same procedure: unplug the cable physically and wait... Because **lab1** cannot contact any other host in the pool (in this case, **lab2**), it starts the fencing procedure. The result is exactly the same as the previous test. It's gone for the pool master, displayed as **Halted** until we re-plug the cable.

0 commit comments

Comments
 (0)