nodedeploy fails if MLD snooping is enabled on switches, unless confluent service is restart at a specific point

HI,

## Problem

On boot, after doing an `nodedeploy`, the node(s) being deployed successfully load `boot.img` (HTTP boot but we have the same problem with PXE and `boot.ipxe` - `deployment.useinsecureprotocols` is set to `firmware`) but then hang at this stage:

<img width="1100" height="662" alt="Image" src="https://github.com/user-attachments/assets/6f34f5f3-3749-4ab7-a37e-efc02a84ffac" />

Eventually it times out, proceeds to the next step where systemd's timeout also kicks in (after a while) and we get dropped to an emergency shell:

<img width="981" height="666" alt="Image" src="https://github.com/user-attachments/assets/8ee07cb7-935e-403f-b6bf-e447355d2f06" />

In which we can confirm it has no IP address (this is from a test VM to demonstrate the problem):

<img width="676" height="137" alt="Image" src="https://github.com/user-attachments/assets/63b5f87c-d191-4afa-9adc-d5f68417b992" />

If we restart confluent (`systemctl restart confluent`) while the node, or nodes (it doesn't matter if it's one node or all of them), are stuck at the initial hang (just after `modprobe: ERROR: counld not insert 'edd': No` is displayed, which I know is irrelevant to the problem) the nodes spring back to life and the install continues successfully.

genesis image also fails to get a network address:

<img width="1067" height="665" alt="Image" src="https://github.com/user-attachments/assets/49627e2b-27d9-4f6f-aac7-8a4297b22e59" />

With `tcpdump` I have observed:

* Confluent responds to the initial DHCP request with a DHCP reply containing the node's correct IP and the correct next-server value (HTTP or PXE, as appropriate) to get the `boot.img`/`boot.ipxe` files.
* When the installer appears hung, tcpdump shows repeated DHCP request but no DHCP replies are sent (from anywhere)
* After restarting confluent, while the installer is first stalled, ~~confluent does send a DHCP reply to the host~~ I was sure I saw this, however I did not save the capture and based on Jarrod's reply I think I may have been mistaken about seeing a DHCP reply (and then the install appears to continue normally)

It looks to me like something is stopping confluent responding to the anaconda installer image's DHCP request, even though it responds to the UEFI's initial request to get to that point and so the nodes are able to boot the installer `boot.img` successfully, until confluent is restarted.

`depoloyment.apiarmed` remains set to `once` throughout, unless we restart confluent (when everything seems to start working normally until the next deployment).

The nodes appear in `nodediscover list` while the installer is hung. When we restart Confluent, the nodes disappear from `nodediscover list` and they then continue to install.

## Diagnostics we've done

We have tried:

* Switching from HTTP to PXE boot (no difference)
* Turning off all firewalls and selinux (no difference)
* Complete reinstall of the confluent server and confluent (no difference)
* Confluent 3.14.1 instead of 3.15.0 (no difference)
* Seeing if redeploying a successfully deployed node behaves differently (no difference)
* Deploying by specifying individual nodes and groups (no difference)
* Setting up `/etc/hosts` with and without FQDN first - i.e. with and without `-f` to `confluent2hosts` (no difference)
* Using an external DHCP server (dnsmasq) and setting the ipv4_method to firmwaredhcp, in this case the host does get an IP address (which we can see when the emergency shell comes up) but still hangs and fails to install unless confluent is restarted.
* Rebooting when hung to see if it continues
* Importing Rocky 9.6 and deploying that to confirm not Red Hat specific issue (no difference)
* `confluent_selfcheck -n` on all nodes reports zero issues (except TFTP when we're configured just for HTTP boot - same issue occurs if we configured for PXE, which which case TFTP Status also passes).
* Test with system with only one NIC, in case related to multiple NICs (no difference)
* Manually setting a UUID (these systems have all come with `00000000-0000-0000-0000-000000000000`) in the UEFI (no difference)
* Manually discovering the node (by MAC or UUID (with a valid UUID configured and displayed) while it is stuck and in the discover list (no difference)

Based on this commit (2cb641e734bfffa54b509856b19d74f2e46a5994):

> Fix PXE based on mac
> We normally use UUID, on a broken platform with bad UUID,
> user may need to use hwaddr.  This was supposed to work, but
> didn't. Fix it to work correctly.

I noticed ALL of my machine's UUIDs are coming up as zero. I has assumed this was due to not being Lenovo hardware. Manually configuring a UUID into the UEFI on one (and clearing the invalid `00000000-0000-0000-0000-000000000000` with `nodeattrib`) made no difference, but the new valid UUID was displayed in `nodediscover list` and was ~~correct in the node attributes after redeployment~~ correction, the UUID did not get readded to `id.uuid` (with a restart of confluent to make it work) but it was displayed correcting in the discover list. `noderemove` and re-doing `nodedefine` (giving it a new `id.index`) did result in the `id.uuid` being correctly set to the value configured in UEFI but did not change this hanging behaviour.

Confluent's events log show the HTTP boot offer and nothing else:

```
Feb 21 00:38:36 {"info": "Offering HTTP boot with static address REDACTED_IP_1 to REDACTED_NODE_1"}
Feb 21 00:38:42 {"info": "Offering HTTP boot with static address REDACTED_IP_2  to REDACTED_NODE_2"}
Feb 21 00:39:35 {"info": "Offering HTTP boot with static address REDACTED_IP_3  to REDACTED_NODE_3"}
```

If we restart confluent when the installer appears hung, at which point the install continues successfully, we see the start-up message added to the log:

```
Feb 21 00:40:29 {"info": "Confluent management service starting"}
```

After the install completed (after restarting confluent), we see the failed HTTP boot (the nodes are configured to network boot first) and the nodes fall-through to their newly installed OS (as expected):

```
Feb 21 00:43:57 {"info": "Ignoring boot attempt by REDACTED_NODE_1 no deployment profile specified (uuid 00000000-0000-0000-0000-000000000000, hwaddr REDACTED_MAC_1)"}
Feb 21 00:44:57 {"info": "Ignoring boot attempt by REDACTED_NODE_2 no deployment profile specified (uuid 00000000-0000-0000-0000-000000000000, hwaddr REDACTED_MAC_2)"}
Feb 21 00:44:57 {"info": "Ignoring boot attempt by REDACTED_NODE_2 no deployment profile specified (uuid 00000000-0000-0000-0000-000000000000, hwaddr REDACTED_MAC_3)"}
```

Nodes are all listed in `nodediscover list` while in the hung state.

## Environment

HTTP booting nodes (same behaviour was observed when we tried PXE).

Mix of different (all non-Lenovo) server hardware.

Some systems have multiple NICs, some on only one.

Confluent 3.15.0 installed on Red Hat 9.6:

```
$ rpm -q lenovo-confluent
lenovo-confluent-3.15.0-1.noarch
$ cat /etc/redhat-release
Red Hat Enterprise Linux release 9.6 (Plow)
```

Clean install and setup as per the confluent instructions, dnsmasq also installed for DNS and no external DHCP service.

Global node attributes (`$DOMAIN` is `cluster.local`, the confluent host and gateway are just two plain RFC1918 IPv4 addresses in the same `/16` subnet):

```bash
nodegroupattrib everything \
deployment.useinsecureprotocols=firmware \
dns.domain=$DOMAIN \
dns.servers=$CONFLUENT_IP \
net.ipv4_gateway=$GATEWAY \
net.bootable=true \
console.method=ipmi
```

Test nodes defined using this command (`$NODEx_` is a placeholder for redacted information):

```bash
sudo -i nodedefine $NODE1_NAME net.cluster.ipv4_address=$NODE1_IP/16 net.cluster.hwaddr=$NODE1_MAC
sudo -i nodedefine $NODE2_NAMEnet.cluster.ipv4_address=$NODE2_IP/16 net.cluster.hwaddr=$NODE2_MAC
sudo -i nodedefine $NODE3_NAMEnet.cluster.ipv4_address=$NODE3_IP/16 net.cluster.hwaddr=$NODE3_MAC1
```

`/etc/hosts` populated with `confluent2hosts -a -f everything`.

os deploy initialiased with:

```bash
osdeploy initialize -a -g -u -s -k -t -l
```

Red Hat 9.6 ISO imported with:

```bash
osdeploy import /tmp/rhel-9.6-x86_64-dvd.iso
```

Nodes setup to deploy with:

```bash
nodedeploy everything -n -p rhel-9.6-x86_64-default 
```

Node is then booted.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nodedeploy fails if MLD snooping is enabled on switches, unless confluent service is restart at a specific point #205

Problem

Diagnostics we've done

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

nodedeploy fails if MLD snooping is enabled on switches, unless confluent service is restart at a specific point #205

Description

Problem

Diagnostics we've done

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions