Skip to content

DNS resolution on Azure Compute hosts running Ubuntu OS stops working once calico-vpp-node pods get up and running #688

@ivansharamok

Description

@ivansharamok

Environment

  • Calico/VPP version: tigera-operator v3.26.3 / Calico VPP v3.26.0 also tried tigera-operator v3.27.2 / Calico VPP v3.27.0
  • Kubernetes version: v1.28.8
  • Deployment type: kubeadm cluster on Azure Compute instances
  • Network configuration: Calico default with VXLAN enabled
  • Pod CIDR: 192.168.0.0/16
  • Service CIDR: 10.96.0.0/12
  • CRI: containerd 1.6.28 (docker is not installed)
  • OS: Ubuntu 22.04
  • kernel: Linux master 5.15.0-1042-azure #49-Ubuntu SMP Tue Jul 11 17:28:46 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Issue description
The calico-vpp-node pods somehow break DNS resolution on the hosts once those pods get fully initialized and running. The /etc/resolv.conf file on the hosts get edited when the calico-vpp-node pod is running. The DNS resolution from within the calico-vpp-node pods works fine. The host's DNS resolution is what gets affected which doesn't allow all Calico VPP components to get configured correctly as some pods get stuck in ImagePullBackOff state.

To Reproduce
Steps to reproduce the behavior:

  • provision Azure Compute instances (e.g. control-plane1, worker1). Used Standard_D4s_v3 size instances
  • deploy kubeadm cluster. Used kubeadm v1.28.8 version
  • install Calico VPP. Used calico-vpp-nohuge.yaml
  • edited CALICOVPP_INTERFACES to use interfaceName: eth0 instead of the default eth1 as shown below:
  CALICOVPP_INTERFACES: |-
    {
      "maxPodIfSpec": {
        "rx": 10, "tx": 10, "rxqsz": 1024, "txqsz": 1024
      },
      "defaultPodIfSpec": {
        "rx": 1, "tx":1, "isl3": true
      },
      "vppHostTapSpec": {
        "rx": 1, "tx":1, "rxqsz": 1024, "txqsz": 1024, "isl3": false
      },
      "uplinkInterfaces": [
        {
          "interfaceName": "eth0",
          "vppDriver": "af_packet"
        }
      ]
    }
  • installation-default.yaml was edit as the following:
kind: Installation
metadata:
  name: default
spec:
  # Configures Calico networking.
  calicoNetwork:
    linuxDataplane: VPP
    ipPools:
    - cidr: 192.168.0.0/16
      encapsulation: VXLAN

Expected behavior
Installation of Calico VPP should not disrupt host's DNS resolution.

Additional context

  • the order of manifest installation
kubectl apply --server-side --force-conflicts -f tigera-operator.yaml
kubectl apply -f installation-default.yaml
kubectl apply -f calico-vpp-nohuge.yaml
  • while calico-vpp-node pods getting initialized, the DNS resolution on the host works as expected. However, once the calico-vpp-dataplane/calico-vpp-node pods get to the Running state, the DNS resolution stops working on the host and /etc/resolv.conf file gets modified.
  • example of /etc/resolv.conf on the host before Calico VPP is installed
nameserver 127.0.0.53
options edns0 trust-ad
search abkhse5g3e5ebd4v3jenyazk4h.xx.internal.cloudapp.net
  • example of the /etc/resolv.conf on the host after calico-vpp-node pod reaches the Running state
nameserver 127.0.0.53
options edns0 trust-ad
search .
  • example of the /etc/resolv.conf inside the calico-vpp-node pods
search abkhse5g3e5ebd4v3jenyazk4h.xx.internal.cloudapp.net
nameserver 168.63.129.16
  • I have not issue getting a response when running curl google.com from within the calico-vpp-node pod, but the same query fails on the host with the message curl: (6) Could not resolve host: google.com
  • I noticed that Calico VPP seems to add the service CIDR into the routing table on the host. I'm not sure if this has any impact on host's DNS resolution, but the programming of that route seems to correlate with the time when DNS resolution on the host stops working.
  • example of programmed routes on the host before calico-vpp-node is up or right after when you manually kill the pod and before it's back up
default via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
168.63.129.16 via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
169.254.169.254 via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
172.10.1.0/24 dev eth0 proto kernel scope link src 172.10.1.4 metric 100
172.10.1.1 dev eth0 proto dhcp scope link src 172.10.1.4 metric 100
  • example of programmed routes on the host after the calico-vpp-node pod is up
default via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
10.96.0.0/12 via 172.10.1.254 dev eth0 proto static mtu 1440
168.63.129.16 via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
169.254.169.254 via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
172.10.1.0/24 dev eth0 proto kernel scope link src 172.10.1.4
172.10.1.0/24 dev eth0 proto kernel scope link src 172.10.1.4 metric 100
172.10.1.1 dev eth0 proto dhcp scope link src 172.10.1.4 metric 100
192.168.0.0/16 via 172.10.1.254 dev eth0 proto static mtu 1440
  • one way I can get pods to pull necessary images after the calico-vpp-node pods get up and running, is to manually kill the calico-vpp-node pods and force restart the pods that are failing to pull the images. Since it takes the calico-vpp-node pods a few moments to get to the Running state, the other cycled workload pods usually get a chance to start pulling the image before DNS resolution is broken again.
  • I can get a bit better workaround if I manually edit the /etc/resolv.conf file on the host and make it look like the one I fetch from within the calico-vpp-node pods. The DNS starts working until the calico-vpp-node gets restarted as the restart of that pod seems to overwrite the /etc/resolv.conf file once again.

Would like to understand what breaks the DNS resolution on the hosts when Calico VPP dataplane gets installed on the cluster.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions