DNS resolution on Azure Compute hosts running Ubuntu OS stops working once calico-vpp-node pods get up and running

**Environment**
- Calico/VPP version: tigera-operator v3.26.3 / Calico VPP v3.26.0 also tried tigera-operator v3.27.2 / Calico VPP v3.27.0
- Kubernetes version: v1.28.8
- Deployment type: kubeadm cluster on Azure Compute instances
- Network configuration: Calico default with VXLAN enabled
- Pod CIDR: 192.168.0.0/16
- Service CIDR: 10.96.0.0/12
- CRI: containerd 1.6.28 (docker is not installed)
- OS: Ubuntu 22.04
- kernel: `Linux master 5.15.0-1042-azure #49-Ubuntu SMP Tue Jul 11 17:28:46 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux`

**Issue description**
The `calico-vpp-node` pods somehow break DNS resolution on the hosts once those pods get fully initialized and running. The `/etc/resolv.conf` file on the hosts get edited when the `calico-vpp-node` pod is running. The DNS resolution from within the `calico-vpp-node` pods works fine. The host's DNS resolution is what gets affected which doesn't allow all Calico VPP components to get configured correctly as some pods get stuck in `ImagePullBackOff` state.

**To Reproduce**
Steps to reproduce the behavior:
- provision Azure Compute instances (e.g. control-plane1, worker1). Used `Standard_D4s_v3` size instances
- deploy kubeadm cluster. Used kubeadm v1.28.8 version
- install Calico VPP. Used [calico-vpp-nohuge.yaml](https://raw.githubusercontent.com/projectcalico/vpp-dataplane/v3.26.0/yaml/generated/calico-vpp-nohuge.yaml)
- edited CALICOVPP_INTERFACES to use `interfaceName: eth0` instead of the default `eth1` as shown below:

```
  CALICOVPP_INTERFACES: |-
    {
      "maxPodIfSpec": {
        "rx": 10, "tx": 10, "rxqsz": 1024, "txqsz": 1024
      },
      "defaultPodIfSpec": {
        "rx": 1, "tx":1, "isl3": true
      },
      "vppHostTapSpec": {
        "rx": 1, "tx":1, "rxqsz": 1024, "txqsz": 1024, "isl3": false
      },
      "uplinkInterfaces": [
        {
          "interfaceName": "eth0",
          "vppDriver": "af_packet"
        }
      ]
    }
``` 

- `installation-default.yaml` was edit as the following:

```
kind: Installation
metadata:
  name: default
spec:
  # Configures Calico networking.
  calicoNetwork:
    linuxDataplane: VPP
    ipPools:
    - cidr: 192.168.0.0/16
      encapsulation: VXLAN
```

**Expected behavior**
Installation of Calico VPP should not disrupt host's DNS resolution.

**Additional context**
- the order of manifest installation

```
kubectl apply --server-side --force-conflicts -f tigera-operator.yaml
kubectl apply -f installation-default.yaml
kubectl apply -f calico-vpp-nohuge.yaml
```

- while `calico-vpp-node` pods getting initialized, the DNS resolution on the host works as expected. However, once the `calico-vpp-dataplane/calico-vpp-node` pods get to the `Running` state, the DNS resolution stops working on the host and `/etc/resolv.conf` file gets modified.
- example of `/etc/resolv.conf` on the host before Calico VPP is installed

```
nameserver 127.0.0.53
options edns0 trust-ad
search abkhse5g3e5ebd4v3jenyazk4h.xx.internal.cloudapp.net
```

- example of the `/etc/resolv.conf` on the host after `calico-vpp-node` pod reaches the `Running` state

```
nameserver 127.0.0.53
options edns0 trust-ad
search .
```

- example of the `/etc/resolv.conf` **inside the `calico-vpp-node`** pods

```
search abkhse5g3e5ebd4v3jenyazk4h.xx.internal.cloudapp.net
nameserver 168.63.129.16
```

- I have not issue getting a response when running `curl google.com` from within the `calico-vpp-node` pod, but the same query fails on the host with the message `curl: (6) Could not resolve host: google.com`
- I noticed that Calico VPP seems to add the service CIDR into the routing table on the host. I'm not sure if this has any impact on host's DNS resolution, but the programming of that route seems to correlate with the time when DNS resolution on the host stops working.
- example of programmed routes on the host **before** `calico-vpp-node` is up or right after when you manually kill the pod and before it's back up

```
default via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
168.63.129.16 via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
169.254.169.254 via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
172.10.1.0/24 dev eth0 proto kernel scope link src 172.10.1.4 metric 100
172.10.1.1 dev eth0 proto dhcp scope link src 172.10.1.4 metric 100
```

- example of programmed routes on the host **after** the `calico-vpp-node` pod is up

```
default via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
10.96.0.0/12 via 172.10.1.254 dev eth0 proto static mtu 1440
168.63.129.16 via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
169.254.169.254 via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
172.10.1.0/24 dev eth0 proto kernel scope link src 172.10.1.4
172.10.1.0/24 dev eth0 proto kernel scope link src 172.10.1.4 metric 100
172.10.1.1 dev eth0 proto dhcp scope link src 172.10.1.4 metric 100
192.168.0.0/16 via 172.10.1.254 dev eth0 proto static mtu 1440
```

- one way I can get pods to pull necessary images after the `calico-vpp-node` pods get up and running, is to manually kill the `calico-vpp-node` pods and force restart the pods that are failing to pull the images. Since it takes the `calico-vpp-node` pods a few moments to get to the Running state, the other cycled workload pods usually get a chance to start pulling the image before DNS resolution is broken again.
- I can get a bit better workaround if I manually edit the `/etc/resolv.conf` file on the host and make it look like the one I fetch from within the `calico-vpp-node` pods. The DNS starts working until the `calico-vpp-node` gets restarted as the restart of that pod seems to overwrite the `/etc/resolv.conf` file once again.

Would like to understand what breaks the DNS resolution on the hosts when Calico VPP dataplane gets installed on the cluster.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DNS resolution on Azure Compute hosts running Ubuntu OS stops working once calico-vpp-node pods get up and running #688

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DNS resolution on Azure Compute hosts running Ubuntu OS stops working once calico-vpp-node pods get up and running #688

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions