Skip to content

nodedeploy fails if MLD snooping is enabled on switches, unless confluent service is restart at a specific point #205

@LHurst-xCATHPC

Description

@LHurst-xCATHPC

HI,

Problem

On boot, after doing an nodedeploy, the node(s) being deployed successfully load boot.img (HTTP boot but we have the same problem with PXE and boot.ipxe - deployment.useinsecureprotocols is set to firmware) but then hang at this stage:

Image

Eventually it times out, proceeds to the next step where systemd's timeout also kicks in (after a while) and we get dropped to an emergency shell:

Image

In which we can confirm it has no IP address (this is from a test VM to demonstrate the problem):

Image

If we restart confluent (systemctl restart confluent) while the node, or nodes (it doesn't matter if it's one node or all of them), are stuck at the initial hang (just after modprobe: ERROR: counld not insert 'edd': No is displayed, which I know is irrelevant to the problem) the nodes spring back to life and the install continues successfully.

genesis image also fails to get a network address:

Image

With tcpdump I have observed:

  • Confluent responds to the initial DHCP request with a DHCP reply containing the node's correct IP and the correct next-server value (HTTP or PXE, as appropriate) to get the boot.img/boot.ipxe files.
  • When the installer appears hung, tcpdump shows repeated DHCP request but no DHCP replies are sent (from anywhere)
  • After restarting confluent, while the installer is first stalled, confluent does send a DHCP reply to the host I was sure I saw this, however I did not save the capture and based on Jarrod's reply I think I may have been mistaken about seeing a DHCP reply (and then the install appears to continue normally)

It looks to me like something is stopping confluent responding to the anaconda installer image's DHCP request, even though it responds to the UEFI's initial request to get to that point and so the nodes are able to boot the installer boot.img successfully, until confluent is restarted.

depoloyment.apiarmed remains set to once throughout, unless we restart confluent (when everything seems to start working normally until the next deployment).

The nodes appear in nodediscover list while the installer is hung. When we restart Confluent, the nodes disappear from nodediscover list and they then continue to install.

Diagnostics we've done

We have tried:

  • Switching from HTTP to PXE boot (no difference)
  • Turning off all firewalls and selinux (no difference)
  • Complete reinstall of the confluent server and confluent (no difference)
  • Confluent 3.14.1 instead of 3.15.0 (no difference)
  • Seeing if redeploying a successfully deployed node behaves differently (no difference)
  • Deploying by specifying individual nodes and groups (no difference)
  • Setting up /etc/hosts with and without FQDN first - i.e. with and without -f to confluent2hosts (no difference)
  • Using an external DHCP server (dnsmasq) and setting the ipv4_method to firmwaredhcp, in this case the host does get an IP address (which we can see when the emergency shell comes up) but still hangs and fails to install unless confluent is restarted.
  • Rebooting when hung to see if it continues
  • Importing Rocky 9.6 and deploying that to confirm not Red Hat specific issue (no difference)
  • confluent_selfcheck -n on all nodes reports zero issues (except TFTP when we're configured just for HTTP boot - same issue occurs if we configured for PXE, which which case TFTP Status also passes).
  • Test with system with only one NIC, in case related to multiple NICs (no difference)
  • Manually setting a UUID (these systems have all come with 00000000-0000-0000-0000-000000000000) in the UEFI (no difference)
  • Manually discovering the node (by MAC or UUID (with a valid UUID configured and displayed) while it is stuck and in the discover list (no difference)

Based on this commit (2cb641e):

Fix PXE based on mac
We normally use UUID, on a broken platform with bad UUID,
user may need to use hwaddr. This was supposed to work, but
didn't. Fix it to work correctly.

I noticed ALL of my machine's UUIDs are coming up as zero. I has assumed this was due to not being Lenovo hardware. Manually configuring a UUID into the UEFI on one (and clearing the invalid 00000000-0000-0000-0000-000000000000 with nodeattrib) made no difference, but the new valid UUID was displayed in nodediscover list and was correct in the node attributes after redeployment correction, the UUID did not get readded to id.uuid (with a restart of confluent to make it work) but it was displayed correcting in the discover list. noderemove and re-doing nodedefine (giving it a new id.index) did result in the id.uuid being correctly set to the value configured in UEFI but did not change this hanging behaviour.

Confluent's events log show the HTTP boot offer and nothing else:

Feb 21 00:38:36 {"info": "Offering HTTP boot with static address REDACTED_IP_1 to REDACTED_NODE_1"}
Feb 21 00:38:42 {"info": "Offering HTTP boot with static address REDACTED_IP_2  to REDACTED_NODE_2"}
Feb 21 00:39:35 {"info": "Offering HTTP boot with static address REDACTED_IP_3  to REDACTED_NODE_3"}

If we restart confluent when the installer appears hung, at which point the install continues successfully, we see the start-up message added to the log:

Feb 21 00:40:29 {"info": "Confluent management service starting"}

After the install completed (after restarting confluent), we see the failed HTTP boot (the nodes are configured to network boot first) and the nodes fall-through to their newly installed OS (as expected):

Feb 21 00:43:57 {"info": "Ignoring boot attempt by REDACTED_NODE_1 no deployment profile specified (uuid 00000000-0000-0000-0000-000000000000, hwaddr REDACTED_MAC_1)"}
Feb 21 00:44:57 {"info": "Ignoring boot attempt by REDACTED_NODE_2 no deployment profile specified (uuid 00000000-0000-0000-0000-000000000000, hwaddr REDACTED_MAC_2)"}
Feb 21 00:44:57 {"info": "Ignoring boot attempt by REDACTED_NODE_2 no deployment profile specified (uuid 00000000-0000-0000-0000-000000000000, hwaddr REDACTED_MAC_3)"}

Nodes are all listed in nodediscover list while in the hung state.

Environment

HTTP booting nodes (same behaviour was observed when we tried PXE).

Mix of different (all non-Lenovo) server hardware.

Some systems have multiple NICs, some on only one.

Confluent 3.15.0 installed on Red Hat 9.6:

$ rpm -q lenovo-confluent
lenovo-confluent-3.15.0-1.noarch
$ cat /etc/redhat-release
Red Hat Enterprise Linux release 9.6 (Plow)

Clean install and setup as per the confluent instructions, dnsmasq also installed for DNS and no external DHCP service.

Global node attributes ($DOMAIN is cluster.local, the confluent host and gateway are just two plain RFC1918 IPv4 addresses in the same /16 subnet):

nodegroupattrib everything \
deployment.useinsecureprotocols=firmware \
dns.domain=$DOMAIN \
dns.servers=$CONFLUENT_IP \
net.ipv4_gateway=$GATEWAY \
net.bootable=true \
console.method=ipmi

Test nodes defined using this command ($NODEx_ is a placeholder for redacted information):

sudo -i nodedefine $NODE1_NAME net.cluster.ipv4_address=$NODE1_IP/16 net.cluster.hwaddr=$NODE1_MAC
sudo -i nodedefine $NODE2_NAMEnet.cluster.ipv4_address=$NODE2_IP/16 net.cluster.hwaddr=$NODE2_MAC
sudo -i nodedefine $NODE3_NAMEnet.cluster.ipv4_address=$NODE3_IP/16 net.cluster.hwaddr=$NODE3_MAC1

/etc/hosts populated with confluent2hosts -a -f everything.

os deploy initialiased with:

osdeploy initialize -a -g -u -s -k -t -l

Red Hat 9.6 ISO imported with:

osdeploy import /tmp/rhel-9.6-x86_64-dvd.iso

Nodes setup to deploy with:

nodedeploy everything -n -p rhel-9.6-x86_64-default 

Node is then booted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions