-
Notifications
You must be signed in to change notification settings - Fork 47
Description
HI,
Problem
On boot, after doing an nodedeploy, the node(s) being deployed successfully load boot.img (HTTP boot but we have the same problem with PXE and boot.ipxe - deployment.useinsecureprotocols is set to firmware) but then hang at this stage:
Eventually it times out, proceeds to the next step where systemd's timeout also kicks in (after a while) and we get dropped to an emergency shell:
In which we can confirm it has no IP address (this is from a test VM to demonstrate the problem):
If we restart confluent (systemctl restart confluent) while the node, or nodes (it doesn't matter if it's one node or all of them), are stuck at the initial hang (just after modprobe: ERROR: counld not insert 'edd': No is displayed, which I know is irrelevant to the problem) the nodes spring back to life and the install continues successfully.
genesis image also fails to get a network address:
With tcpdump I have observed:
- Confluent responds to the initial DHCP request with a DHCP reply containing the node's correct IP and the correct next-server value (HTTP or PXE, as appropriate) to get the
boot.img/boot.ipxefiles. - When the installer appears hung, tcpdump shows repeated DHCP request but no DHCP replies are sent (from anywhere)
- After restarting confluent, while the installer is first stalled,
confluent does send a DHCP reply to the hostI was sure I saw this, however I did not save the capture and based on Jarrod's reply I think I may have been mistaken about seeing a DHCP reply (and then the install appears to continue normally)
It looks to me like something is stopping confluent responding to the anaconda installer image's DHCP request, even though it responds to the UEFI's initial request to get to that point and so the nodes are able to boot the installer boot.img successfully, until confluent is restarted.
depoloyment.apiarmed remains set to once throughout, unless we restart confluent (when everything seems to start working normally until the next deployment).
The nodes appear in nodediscover list while the installer is hung. When we restart Confluent, the nodes disappear from nodediscover list and they then continue to install.
Diagnostics we've done
We have tried:
- Switching from HTTP to PXE boot (no difference)
- Turning off all firewalls and selinux (no difference)
- Complete reinstall of the confluent server and confluent (no difference)
- Confluent 3.14.1 instead of 3.15.0 (no difference)
- Seeing if redeploying a successfully deployed node behaves differently (no difference)
- Deploying by specifying individual nodes and groups (no difference)
- Setting up
/etc/hostswith and without FQDN first - i.e. with and without-ftoconfluent2hosts(no difference) - Using an external DHCP server (dnsmasq) and setting the ipv4_method to firmwaredhcp, in this case the host does get an IP address (which we can see when the emergency shell comes up) but still hangs and fails to install unless confluent is restarted.
- Rebooting when hung to see if it continues
- Importing Rocky 9.6 and deploying that to confirm not Red Hat specific issue (no difference)
confluent_selfcheck -non all nodes reports zero issues (except TFTP when we're configured just for HTTP boot - same issue occurs if we configured for PXE, which which case TFTP Status also passes).- Test with system with only one NIC, in case related to multiple NICs (no difference)
- Manually setting a UUID (these systems have all come with
00000000-0000-0000-0000-000000000000) in the UEFI (no difference) - Manually discovering the node (by MAC or UUID (with a valid UUID configured and displayed) while it is stuck and in the discover list (no difference)
Based on this commit (2cb641e):
Fix PXE based on mac
We normally use UUID, on a broken platform with bad UUID,
user may need to use hwaddr. This was supposed to work, but
didn't. Fix it to work correctly.
I noticed ALL of my machine's UUIDs are coming up as zero. I has assumed this was due to not being Lenovo hardware. Manually configuring a UUID into the UEFI on one (and clearing the invalid 00000000-0000-0000-0000-000000000000 with nodeattrib) made no difference, but the new valid UUID was displayed in nodediscover list and was correct in the node attributes after redeployment correction, the UUID did not get readded to id.uuid (with a restart of confluent to make it work) but it was displayed correcting in the discover list. noderemove and re-doing nodedefine (giving it a new id.index) did result in the id.uuid being correctly set to the value configured in UEFI but did not change this hanging behaviour.
Confluent's events log show the HTTP boot offer and nothing else:
Feb 21 00:38:36 {"info": "Offering HTTP boot with static address REDACTED_IP_1 to REDACTED_NODE_1"}
Feb 21 00:38:42 {"info": "Offering HTTP boot with static address REDACTED_IP_2 to REDACTED_NODE_2"}
Feb 21 00:39:35 {"info": "Offering HTTP boot with static address REDACTED_IP_3 to REDACTED_NODE_3"}
If we restart confluent when the installer appears hung, at which point the install continues successfully, we see the start-up message added to the log:
Feb 21 00:40:29 {"info": "Confluent management service starting"}
After the install completed (after restarting confluent), we see the failed HTTP boot (the nodes are configured to network boot first) and the nodes fall-through to their newly installed OS (as expected):
Feb 21 00:43:57 {"info": "Ignoring boot attempt by REDACTED_NODE_1 no deployment profile specified (uuid 00000000-0000-0000-0000-000000000000, hwaddr REDACTED_MAC_1)"}
Feb 21 00:44:57 {"info": "Ignoring boot attempt by REDACTED_NODE_2 no deployment profile specified (uuid 00000000-0000-0000-0000-000000000000, hwaddr REDACTED_MAC_2)"}
Feb 21 00:44:57 {"info": "Ignoring boot attempt by REDACTED_NODE_2 no deployment profile specified (uuid 00000000-0000-0000-0000-000000000000, hwaddr REDACTED_MAC_3)"}
Nodes are all listed in nodediscover list while in the hung state.
Environment
HTTP booting nodes (same behaviour was observed when we tried PXE).
Mix of different (all non-Lenovo) server hardware.
Some systems have multiple NICs, some on only one.
Confluent 3.15.0 installed on Red Hat 9.6:
$ rpm -q lenovo-confluent
lenovo-confluent-3.15.0-1.noarch
$ cat /etc/redhat-release
Red Hat Enterprise Linux release 9.6 (Plow)
Clean install and setup as per the confluent instructions, dnsmasq also installed for DNS and no external DHCP service.
Global node attributes ($DOMAIN is cluster.local, the confluent host and gateway are just two plain RFC1918 IPv4 addresses in the same /16 subnet):
nodegroupattrib everything \
deployment.useinsecureprotocols=firmware \
dns.domain=$DOMAIN \
dns.servers=$CONFLUENT_IP \
net.ipv4_gateway=$GATEWAY \
net.bootable=true \
console.method=ipmiTest nodes defined using this command ($NODEx_ is a placeholder for redacted information):
sudo -i nodedefine $NODE1_NAME net.cluster.ipv4_address=$NODE1_IP/16 net.cluster.hwaddr=$NODE1_MAC
sudo -i nodedefine $NODE2_NAMEnet.cluster.ipv4_address=$NODE2_IP/16 net.cluster.hwaddr=$NODE2_MAC
sudo -i nodedefine $NODE3_NAMEnet.cluster.ipv4_address=$NODE3_IP/16 net.cluster.hwaddr=$NODE3_MAC1/etc/hosts populated with confluent2hosts -a -f everything.
os deploy initialiased with:
osdeploy initialize -a -g -u -s -k -t -lRed Hat 9.6 ISO imported with:
osdeploy import /tmp/rhel-9.6-x86_64-dvd.isoNodes setup to deploy with:
nodedeploy everything -n -p rhel-9.6-x86_64-default Node is then booted.