Troubleshooting¶
This section provides the steps to troubleshoot and fix some of the common issues that the user encounters. Following are some of the possible error messages with steps to troubleshoot and fix the errors.
Troubleshooting common error scenarios¶
ERROR: Ingress Traffic is not working¶
Cause: No static route is created or TMM is rejecting the static route.
TMM logs:
Run the following command to get the TMM logs:
kubectl logs deploy/f5-tmm -c f5-tmm -f
Sample output:
...
decl_static_route_handler/174: Creating new route_entry: app-ns-static-route-10.244.0.2
decl_static_route_handler/210: Adding gateway: app-ns-static-route-10.244.0.2
decl_static_route_handler/236: route is unresolved: app-ns-static-route-10.244.0.2 ERR_RTE
<134>Oct 1 04:52:57 f5-tmm-57d4488c4-2x9x4 tmm[15]: 01010058:6: audit log: action: CREATE; UUID: app-ns-static-route-10.244.0.2; event: declTmm.static_route; Error: No error
decl_traffic_matching_criteria_handler/837: Received create tmc message app-ns-gatewayapi-tcp-app-1-0-tmc
FIX:¶
Check node annotation:
Add a notation to the
Host Node
by setting the IP address of the VF interface that is connected to the DPU/TMM.To ensure that static routes are created for sending traffic to the TMM pod on the DPU, adding an annotation to the
Host Node
is required. The CIDR range of the IP address has to be in the same network CIDR range of the internal network.For example, 192.20.28.146/22 is the IP on the
Host Node
Virtual Function (VF) interface that is connected to the DPU node throughsf_internal
bridge.Run the following command to annotate the
Host Node
kubectl annotate node arm64-sm-2 k8s.ovn.org/node-primary-ifaddr=’{"ipv4":"192.20.28.146/22"}’
Note: The IP address is the IP address of the VF Interface and the node is the name of the node where the application pod is running.
Check IP on interface:
Check if the external Node interface has an IP address on the interface in the same CIDR range as your IPAddress.
Verify whether you can ping the IPAddress.
kubectl get pods -n <application namespace>
Run the following command to verify that the nginx application running:
kubectl get pods -w -n app-ns -owide
Sample output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-deploy-5798c85b9c-qtqnd 1/1 Running 0 8h 10.244.0.122 arm64-sm-2 <none> <none>
ERROR: TMM not starting¶
Verify the TMM logs on f5-tmm
container. Run the following command to get the logs:
kubectl logs pod/f5-tmm-cf595cb87-wdfrn -n f5-spk
Sample output:
dpdk: mempool_alloc Successfully created RTE_RING
dpdk: mempool_alloc RTE_RING descriptor count: 262144. MBUF header count: 262143.
xnet_dev [mlx5_core.sf.4]: Kernel driver is already unbound or no such device
xnet_dev [mlx5_core.sf.4]: Error: Failed to open /sys/bus/pci/devices/mlx5_core.sf.4/driver_override
dpdk[mlx5_core.sf.4]: Error: **** Fatal xnet DPDK Driver Configuration Error **** Failed to bind uio_pci_generic driver
TMM clock is 0 seconds from system time
ticks since last clock update: 161
ticks since start of poll: 111850104
TMM version: no_pgo aarch64 TMM Version 0.1010.1+0.1.5 Build Date: Tue Sep 17 17:15:59 2024
FIX:¶
Perform the following checks to debug the TMM not starting issue:
Verify whether the
vfio_pci
kernel module is loaded on the DPU node:Sample output:
root@localhost:~# lsmod | grep vfio root@localhost:~# modprobe vfio_pci root@localhost:~# lsmod | grep vfio vfio_pci 16384 0 vfio_pci_core 69632 1 vfio_pci vfio_virqfd 20480 1 vfio_pci_core vfio_iommu_type1 49152 0 vfio 45056 2 vfio_pci_core,vfio_iommu_type1
Verify whether the SRIOV plugin is creating the scalable function (SF) resources for K8S.
Run the following command to verify:
kubectl get pods -n kube-system -owide
Sample output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES coredns-76f75df574-4pnbt 1/1 Running 0 28h 10.244.0.35 sm-mgx1 <none> <none> coredns-76f75df574-pgjx4 1/1 Running 0 28h 10.244.0.36 sm-mgx1 <none> <none> etcd-sm-mgx1 1/1 Running 19 28h 10.144.47.136 sm-mgx1 <none> <none> kube-apiserver-sm-mgx1 1/1 Running 15 28h 10.144.47.136 sm-mgx1 <none> <none> kube-controller-manager-sm-mgx1 1/1 Running 2 28h 10.144.47.136 sm-mgx1 <none> <none> kube-proxy-9hnxf 1/1 Running 0 28h 10.144.47.137 localhost.localdomain <none> <none> kube-proxy-zzbjd 1/1 Running 0 28h 10.144.47.136 sm-mgx1 <none> <none> kube-scheduler-sm-mgx1 1/1 Running 18 28h 10.144.47.136 sm-mgx1 <none> <none> kube-sriov-device-plugin-sgjl2 1/1 Running 0 7h 10.144.47.137 localhost.localdomain <none> <none> kube-sriov-device-plugin-vgnkt 1/1 Running 0 7h 10.144.47.136 sm-mgx1 <none> <none>
Check the
kube-sriov-device-plugin
logs running on theHost Node
.Run the following command to check:
kubectl logs pod/kube-sriov-device-plugin-vgnkt -n kube-system
Look for the
configmap
in the logs:"resourceList": [ { "resourceName": "bf3_p0_sf", "resourcePrefix": "nvidia.com", "deviceType": "auxNetDevice", "selectors": [{ "vendors": ["15b3"], "devices": ["a2dc"], "pciAddresses": ["0000:03:00.0"], "pfNames": ["p0#1"], "auxTypes": ["sf"] }] }, { "resourceName": "bf3_p1_sf", "resourcePrefix": "nvidia.com", "deviceType": "auxNetDevice", "selectors": [{ "vendors": ["15b3"], "devices": ["a2dc"], "pciAddresses": ["0000:03:00.1"], "pfNames": ["p1#1"], "auxTypes": ["sf"] }] } ] } I1001 09:47:28.773307 1 main.go:68] Initializing resource servers I1001 09:47:28.773313 1 manager.go:117] number of config: 2 I1001 09:47:28.773327 1 manager.go:121] Creating new ResourcePool: bf3_p0_sf <----- Trying to create I1001 09:47:28.773332 1 manager.go:122] DeviceType: auxNetDevice W1001 09:47:28.774010 1 auxNetDeviceProvider.go:61] auxnetdevice GetDevices(): error creating new device mlx5_core.ctl.0 PCI 0000:01:00.0: "cannot get sfnum for mlx5_core.ctl.0 device: stat /sys/bus/auxiliary/devices/mlx5_core.ctl.0/sfnum: no such file or directory" W1001 09:47:28.774384 1 auxNetDeviceProvider.go:61] auxnetdevice GetDevices(): error creating new device mlx5_core.eth.0 PCI 0000:01:00.0: "cannot get sfnum for mlx5_core.eth.0 device: stat /sys/bus/auxiliary/devices/mlx5_core.eth.0/sfnum: no such file or directory" … I1001 09:47:28.879397 1 manager.go:138] initServers(): selector index 0 will register 0 devices I1001 09:47:28.879523 1 manager.go:142] no devices in device pool, skipping creating resource server for bf3_p0_sf <--- If you see this, then the SF resource is NOT created.
If
configmap
is not found, check whether the scalable functions are created properly. Also, verify that the Network Attachment Definition Custom Resources are created.kubectl get net-attach-def -A
NAMESPACE NAME AGE f5-spk sf-external 41s f5-spk sf-internal 41s
ERROR: VLAN¶
While trying to create VLAN, the following error is observed:
octo@arm64-sm-2:~/orchestrator$ kubectl apply -f vlan.yaml
Error from server (InternalError): error when creating "vlan.yaml": Internal error occurred: failed calling webhook "f5validate.f5net.com": failed to call webhook: Post "https://f5-validation-svc.default.svc:5000/f5-validator?timeout=10s": tls: failed to verify certificate: x509: certificate has expired or is not yet valid: current time 2024-10-18T18:48:38Z is after 2024-10-18T05:28:30Z
FIX:¶
Check whether the
clusterissuer
is created. To check, run the following command:kubectl get clusterissuer No resources found
Note: The error from validating webhook indicates that the webhook cannot authenticate with the api server. This is from validating webhook and not the conversion webhook.
Check for the CRD installation values. If the values are placed in wrong direction, go back to validating webhook.
Set the annotation properly in
validatingwebhookconfiguration
forcainjector
to inject theca
to the api server.Check for
validatingwebhookconfiguration
.kubectl get validatingwebhookconfiguration f5validate-default -o yaml | grep cert cert-manager.io/inject-ca-from: default/tls-f5ingress-webhookvalidating-svr
Check for secrets which should contain the
ca
. Run the following command in the terminal:kubectl get secret | grep tls-f5ingress-webhookvalidating-svr
Sample output:
tls-f5ingress-webhookvalidating-svr-98clw Opaque 1 20m tls-f5ingress-webhookvalidating-svr-secret kubernetes.io/tls 3 26d
The first value in the sample output indicates that the
issuer
/clusterissuer
is not ready. Hence, the secret name is appended with a random number.The second value indicates that the secrets are not properly deleted in the previous installation.
Delete the secrets with
helm uninstall f5ingress
command or delete thespkinstance
CR.The secrets stay in the cluster though the pods or deployments are deleted manually before.
Clean-up the environment and ensure that all the
tls-*
secrets are deleted.Configure the
ca
secret and create theclusterissuer
for Cert Manager. To create theclusterissuer
, follow the instructions provided in Create Clusterissuer and Certificates guide.
ERROR: Cannot join DPU node to Kubernetes cluster¶
The following error may occur while trying to join DPU node to the Kubernetes cluster:
ubuntu@localhost:~$ sudo kubeadm join 10.144.47.34:6443 --token xed0f5.9pvw5csqu9whfup0 --discovery-token-ca-cert-hash sha256:9bb2107501225c0613d30c5bb1b47b89aa3457c86f4708cce9354ef225e06496
[preflight] Running pre-flight checks
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sys/net/bridge/bridge-nf-call-iptables does not exist
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher
FIX:¶
Set modprobe br_netfilter
sudo su
modprobe br_netfilter
echo 1 > /proc/sys/net/bridge/bridge-nf-call-iptables
echo 1 > /proc/sys/net/ipv4/ip_forward
For more information on this error and Troubleshooting steps, refer Kubernetes Community Forums.
ERROR: DPU does not have management IP for internet access¶
When the DPU does not have a management interface configured, it will not
have access to the internet nor to the Host Node
.
FIX:¶
Change K8S to advertise on tmfifo_net0
interface and add iptable
rule to send DPU traffic through the Host.
On the host¶
Run the following command on the Host:
iptables -t nat -I POSTROUTING -o eno1 -j MASQUERADE
The fix is to have Kubernetes advertise on the RShim
tmfifo_net0
interface.Start Kubernetes with --apiserver-advertise-address=192.168.100.1 to advertise the kubernetes API at the
tmfifo_net0
interface so the DPU can reach the kubernetes api.sudo kubeadm init --apiserver-advertise-address=192.168.100.1 --pod-network-cidr=10.244.0.0/16
Route all traffic through the en01 interface of the
Host Node
.Change the DNS server to
1.1.1.1
on the DPU node. To change, edit the/etc/netplan/50-cloud-init.yaml
file and change the nameservers foroob_net0
to1.1.1.1
.vi /etc/netplan/50-cloud-init.yaml
# This file is generated from information provided by the datasource. Changes # to it will not persist across an instance reboot. To disable cloud-init's # network configuration capabilities, write a file # /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following: # network: {config: disabled} network: ethernets: oob_net0: dhcp4: true tmfifo_net0: addresses: - 192.168.100.2/30 dhcp4: false nameservers: addresses: - 1.1.1.1 routes: - metric: 1025 to: 0.0.0.0/0 via: 192.168.100.1 renderer: NetworkManager version: 2
Now, run the following command:
netplan apply
ERROR: multus error
¶
The following error may occur if the multus CNI plugin is not installed properly.
kubectl get pods -A -owide
Sample output
...
kube-system kube-multus-ds-jt9jk 0/1 Init:CrashLoopBackOff 5 (2m32s ago) 5m25s 10.146.45.144 arm64-sm-2 <none> <none>
kube-system kube-multus-ds-vmfws 1/1 Running 0 9d 10.146.45.145 localhost.localdomain <none>
FIX:¶
Try the following method to troubleshoot multus
pod CrashLoopBackOff
error.
kubectl describe pod/kube-multus-ds-jt9jk -n kube-system
...
Init Containers:
install-multus-binary:
Container ID: containerd://841c0318f6265009e82bc6c82f41f9e27ebb80bde1ae31c6c4c6693c35562791
Image: ghcr.io/k8snetworkplumbingwg/multus-cni:snapshot-thick
Image ID: ghcr.io/k8snetworkplumbingwg/multus-cni@sha256:6879c6efc5dddd562617168f9394480793a26db34180fb19966c8fdd98022bb4
Port: <none>
Host Port: <none>
Command:
cp
/usr/src/multus-cni/bin/multus-shim
/host/opt/cni/bin/multus-shim
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Message: cp: cannot create regular file '/host/opt/cni/bin/multus-shim': Text file busy
Exit Code: 1
Started: Mon, 28 Oct 2024 16:39:22 +0000
Finished: Mon, 28 Oct 2024 16:39:22 +0000
ERROR: BIG-IP Next for Kubernetes not starting up¶
kubectl describe pod/spkinstance-sample-f5ingress-5d7b978f86-xxw9k
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m11s default-scheduler Successfully assigned default/spkinstance-sample-f5ingress-5d7b978f86-xxw9k to sm-mgx2
Warning FailedMount 4m10s (x2 over 4m10s) kubelet MountVolume.SetUp failed for volume "tls-f5-lic-helper-grpc-svr-volume" : secret "tls-f5-lic-helper-grpc-svr-secret" not found
Warning FailedMount 4m10s (x2 over 4m10s) kubelet MountVolume.SetUp failed for volume "tls-f5ingress-grpc-clt-volume" : secret "tls-f5ingress-grpc-clt-secret" not found
Warning FailedMount 4m9s (x3 over 4m10s) kubelet MountVolume.SetUp failed for volume "external-f5ingotelsvr-volume" : secret "external-f5ingotelsvr-secret" not found
Warning FailedMount 4m9s (x3 over 4m10s) kubelet MountVolume.SetUp failed for volume "tls-tpm-grpc-svr-volume" : secret "tls-tpm-grpc-svr-secret" not found
Warning FailedMount 4m9s (x3 over 4m10s) kubelet MountVolume.SetUp failed for volume "tls-tpm-grpc-clt-volume" : secret "tls-tpm-grpc-clt-secret" not found
Warning FailedMount 4m9s (x3 over 4m10s) kubelet MountVolume.SetUp failed for volume "tls-qkviewfluentbitcontroller-grpc-svr-volume" : secret "tls-qkviewfluentbitcontroller-grpc-svr-secret" not found
Warning FailedMount 4m9s (x3 over 4m10s) kubelet MountVolume.SetUp failed for volume "tls-f5-controller-grpc-svr-volume" : secret "tls-f5-controller-grpc-svr-secret" not found
Warning FailedMount 4m9s (x3 over 4m10s) kubelet MountVolume.SetUp failed for volume "tls-f5-lic-helper-amqp-clt-volume" : secret "tls-f5-lic-helper-amqp-clt-secret" not found
Warning FailedMount 4m9s (x3 over 4m10s) kubelet MountVolume.SetUp failed for volume "tls-f5ingress-webhookvalidating-svr-volume" : secret "tls-f5ingress-webhookvalidating-svr-secret" not found
ERROR: TMM OOMKilled/Crashloopback¶
In the TMM pod, the following OOMKilled
or Crashloopback
error is observed.
default f5-tmm-dpu-594f5f9d4-lp8mq 3/4 OOMKilled
Fix:¶
If you observe OOMKilled
or Crashloopback
error in TMM pod, perform the following steps to fix this issue:
Check TMM Deployment configuration.
Sample output
- name: USE_PHYS_MEM value: "true" - name: TMM_GENERIC_SOCKET_DRIVER value: "false" - name: TMM_MAPRES_ADDL_VETHS_ON_DP values: "false"
Remove both the TMM_GENERIC_SOCKET_DRIVER and TMM_MAPRES_ADDL_VETHS_ON_DP from the configuration.
ERROR: Persistence Problems¶
The fluend and CWC pods have unbound immediate PersistentVolumeClaims
which could cause an error.
FIX:¶
If Persistence is enabled in fluentd or CWC, the user must create the persistent volumes for these objects to depend on.
For fluentd:
If there is a PVC with
\"f5-toda-fluentd\"
name in the toda pod, a persistent volume for this claim has to be created.For CWC: If there is a PVC with
\"cluster-wide-controller\"
name in the toda pod, a persistent volume for this claim has to be created.
TIP: Inspect TMM start command to verify thread count¶
On the Host
octo@2m2d0t0191:~/orchestrator$ kubectl exec -it deploy/f5-tmm -c debug -- bash
/ps aufx
...
root 15 24.8 0.6 70516580 225184 ? SLl 19:58 32:03 \_ tmm.0 --cpu 0-1 --memsize 3072 --physmem --platform Z100 --ignore-bigstart -e --temp-platform-Z102
65535 1 0.0 0.0 792 4 ? Ss 19:58 0:00 /pause
ERROR: f5ingress and f5tmm terminating¶
If the f5ingress and f5tmm pods are terminating, try the following steps to troubleshoot this issue.
Fix:¶
Check the
spkinstance-resource.yaml
for controllerwatchNamespace
:controller: watchNamespace: "app-ns"
Here the
watchNamespace
isapp-ns
.Check if the namespace is created in the cluster
kubectl create namespace app-ns