-
Notifications
You must be signed in to change notification settings - Fork 44
Description
Environment
- Calico/VPP version: tigera-operator v3.26.3 / Calico VPP v3.26.0 also tried tigera-operator v3.27.2 / Calico VPP v3.27.0
- Kubernetes version: v1.28.8
- Deployment type: kubeadm cluster on Azure Compute instances
- Network configuration: Calico default with VXLAN enabled
- Pod CIDR: 192.168.0.0/16
- Service CIDR: 10.96.0.0/12
- CRI: containerd 1.6.28 (docker is not installed)
- OS: Ubuntu 22.04
- kernel:
Linux master 5.15.0-1042-azure #49-Ubuntu SMP Tue Jul 11 17:28:46 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Issue description
The calico-vpp-node pods somehow break DNS resolution on the hosts once those pods get fully initialized and running. The /etc/resolv.conf file on the hosts get edited when the calico-vpp-node pod is running. The DNS resolution from within the calico-vpp-node pods works fine. The host's DNS resolution is what gets affected which doesn't allow all Calico VPP components to get configured correctly as some pods get stuck in ImagePullBackOff state.
To Reproduce
Steps to reproduce the behavior:
- provision Azure Compute instances (e.g. control-plane1, worker1). Used
Standard_D4s_v3size instances - deploy kubeadm cluster. Used kubeadm v1.28.8 version
- install Calico VPP. Used calico-vpp-nohuge.yaml
- edited CALICOVPP_INTERFACES to use
interfaceName: eth0instead of the defaulteth1as shown below:
CALICOVPP_INTERFACES: |-
{
"maxPodIfSpec": {
"rx": 10, "tx": 10, "rxqsz": 1024, "txqsz": 1024
},
"defaultPodIfSpec": {
"rx": 1, "tx":1, "isl3": true
},
"vppHostTapSpec": {
"rx": 1, "tx":1, "rxqsz": 1024, "txqsz": 1024, "isl3": false
},
"uplinkInterfaces": [
{
"interfaceName": "eth0",
"vppDriver": "af_packet"
}
]
}
installation-default.yamlwas edit as the following:
kind: Installation
metadata:
name: default
spec:
# Configures Calico networking.
calicoNetwork:
linuxDataplane: VPP
ipPools:
- cidr: 192.168.0.0/16
encapsulation: VXLAN
Expected behavior
Installation of Calico VPP should not disrupt host's DNS resolution.
Additional context
- the order of manifest installation
kubectl apply --server-side --force-conflicts -f tigera-operator.yaml
kubectl apply -f installation-default.yaml
kubectl apply -f calico-vpp-nohuge.yaml
- while
calico-vpp-nodepods getting initialized, the DNS resolution on the host works as expected. However, once thecalico-vpp-dataplane/calico-vpp-nodepods get to theRunningstate, the DNS resolution stops working on the host and/etc/resolv.conffile gets modified. - example of
/etc/resolv.confon the host before Calico VPP is installed
nameserver 127.0.0.53
options edns0 trust-ad
search abkhse5g3e5ebd4v3jenyazk4h.xx.internal.cloudapp.net
- example of the
/etc/resolv.confon the host aftercalico-vpp-nodepod reaches theRunningstate
nameserver 127.0.0.53
options edns0 trust-ad
search .
- example of the
/etc/resolv.confinside thecalico-vpp-nodepods
search abkhse5g3e5ebd4v3jenyazk4h.xx.internal.cloudapp.net
nameserver 168.63.129.16
- I have not issue getting a response when running
curl google.comfrom within thecalico-vpp-nodepod, but the same query fails on the host with the messagecurl: (6) Could not resolve host: google.com - I noticed that Calico VPP seems to add the service CIDR into the routing table on the host. I'm not sure if this has any impact on host's DNS resolution, but the programming of that route seems to correlate with the time when DNS resolution on the host stops working.
- example of programmed routes on the host before
calico-vpp-nodeis up or right after when you manually kill the pod and before it's back up
default via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
168.63.129.16 via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
169.254.169.254 via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
172.10.1.0/24 dev eth0 proto kernel scope link src 172.10.1.4 metric 100
172.10.1.1 dev eth0 proto dhcp scope link src 172.10.1.4 metric 100
- example of programmed routes on the host after the
calico-vpp-nodepod is up
default via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
10.96.0.0/12 via 172.10.1.254 dev eth0 proto static mtu 1440
168.63.129.16 via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
169.254.169.254 via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
172.10.1.0/24 dev eth0 proto kernel scope link src 172.10.1.4
172.10.1.0/24 dev eth0 proto kernel scope link src 172.10.1.4 metric 100
172.10.1.1 dev eth0 proto dhcp scope link src 172.10.1.4 metric 100
192.168.0.0/16 via 172.10.1.254 dev eth0 proto static mtu 1440
- one way I can get pods to pull necessary images after the
calico-vpp-nodepods get up and running, is to manually kill thecalico-vpp-nodepods and force restart the pods that are failing to pull the images. Since it takes thecalico-vpp-nodepods a few moments to get to the Running state, the other cycled workload pods usually get a chance to start pulling the image before DNS resolution is broken again. - I can get a bit better workaround if I manually edit the
/etc/resolv.conffile on the host and make it look like the one I fetch from within thecalico-vpp-nodepods. The DNS starts working until thecalico-vpp-nodegets restarted as the restart of that pod seems to overwrite the/etc/resolv.conffile once again.
Would like to understand what breaks the DNS resolution on the hosts when Calico VPP dataplane gets installed on the cluster.