Troubleshooting

This section provides the steps to troubleshoot and fix some of the common issues that the user encounters. Following are some of the possible error messages with steps to troubleshoot and fix the errors.

Troubleshooting common error scenarios

ERROR: Ingress Traffic is not working

Cause: No static route is created or TMM is rejecting the static route.

TMM logs:

Run the following command to get the TMM logs:

  kubectl logs deploy/f5-tmm -c f5-tmm -f

Sample output:

  ...

  decl_static_route_handler/174: Creating new route_entry: app-ns-static-route-10.244.0.2

  decl_static_route_handler/210: Adding gateway: app-ns-static-route-10.244.0.2

  decl_static_route_handler/236: route is unresolved: app-ns-static-route-10.244.0.2 ERR_RTE

  <134>Oct  1 04:52:57 f5-tmm-57d4488c4-2x9x4 tmm[15]: 01010058:6: audit log: action: CREATE; UUID: app-ns-static-route-10.244.0.2; event: declTmm.static_route; Error: No error

  decl_traffic_matching_criteria_handler/837: Received create tmc message app-ns-gatewayapi-tcp-app-1-0-tmc

FIX:

  1. Check node annotation:

    • Add a notation to the Host Node by setting the IP address of the VF interface that is connected to the DPU/TMM.

    • To ensure that static routes are created for sending traffic to the TMM pod on the DPU, adding an annotation to the Host Node is required. The CIDR range of the IP address has to be in the same network CIDR range of the internal network.

    • For example, 192.20.28.146/22 is the IP on the Host Node Virtual Function (VF) interface that is connected to the DPU node through sf_internal bridge.

    • Run the following command to annotate the Host Node

      kubectl annotate node arm64-sm-2 k8s.ovn.org/node-primary-ifaddr={"ipv4":"192.20.28.146/22"}

      _images/spk_info.png Note: The IP address is the IP address of the VF Interface and the node is the name of the node where the application pod is running.

  2. Check IP on interface:

    • Check if the external Node interface has an IP address on the interface in the same CIDR range as your IPAddress.

    • Verify whether you can ping the IPAddress.

      kubectl get pods -n <application namespace>

    • Run the following command to verify that the nginx application running:

      kubectl get pods -w -n app-ns -owide

      Sample output:

        NAME                            READY   STATUS    RESTARTS   AGE   IP             NODE         NOMINATED NODE   READINESS GATES
      
        nginx-deploy-5798c85b9c-qtqnd   1/1     Running   0          8h    10.244.0.122   arm64-sm-2   <none>           <none>
      

ERROR: TMM not starting

Verify the TMM logs on f5-tmm container. Run the following command to get the logs:

kubectl logs pod/f5-tmm-cf595cb87-wdfrn -n f5-spk

Sample output:

dpdk: mempool_alloc Successfully created RTE_RING

dpdk: mempool_alloc RTE_RING descriptor count: 262144. MBUF header count: 262143.

xnet_dev [mlx5_core.sf.4]: Kernel driver is already unbound or no such device

xnet_dev [mlx5_core.sf.4]: Error: Failed to open /sys/bus/pci/devices/mlx5_core.sf.4/driver_override

dpdk[mlx5_core.sf.4]: Error: **** Fatal xnet DPDK Driver Configuration Error **** Failed to bind uio_pci_generic driver

TMM clock is 0 seconds from system time

ticks since last clock update: 161

ticks since start of poll: 111850104

TMM version: no_pgo aarch64 TMM Version 0.1010.1+0.1.5 Build Date: Tue Sep 17 17:15:59 2024

FIX:

Perform the following checks to debug the TMM not starting issue:

  1. Verify whether the vfio_pci kernel module is loaded on the DPU node:

    Sample output:

    root@localhost:~# lsmod | grep vfio
    
    root@localhost:~# modprobe vfio_pci
    
    root@localhost:~# lsmod | grep vfio
    
    vfio_pci               16384  0
    
    vfio_pci_core          69632  1 vfio_pci
    
    vfio_virqfd            20480  1 vfio_pci_core
    
    vfio_iommu_type1       49152  0
    
    vfio                   45056  2 vfio_pci_core,vfio_iommu_type1
    
  2. Verify whether the SRIOV plugin is creating the scalable function (SF) resources for K8S.

    Run the following command to verify:

    kubectl get pods -n kube-system -owide

    Sample output:

    NAME                              READY   STATUS    RESTARTS   AGE   IP              NODE                    NOMINATED NODE   READINESS GATES
    
    coredns-76f75df574-4pnbt          1/1     Running   0          28h   10.244.0.35     sm-mgx1                 <none>           <none>
    
    coredns-76f75df574-pgjx4          1/1     Running   0          28h   10.244.0.36     sm-mgx1                 <none>           <none>
    
    etcd-sm-mgx1                      1/1     Running   19         28h   10.144.47.136   sm-mgx1                 <none>           <none>
    
    kube-apiserver-sm-mgx1            1/1     Running   15         28h   10.144.47.136   sm-mgx1                 <none>           <none>
    
    kube-controller-manager-sm-mgx1   1/1     Running   2          28h   10.144.47.136   sm-mgx1                 <none>           <none>
    
    kube-proxy-9hnxf                  1/1     Running   0          28h   10.144.47.137   localhost.localdomain   <none>           <none>
    
    kube-proxy-zzbjd                  1/1     Running   0          28h   10.144.47.136   sm-mgx1                 <none>           <none>
    
    kube-scheduler-sm-mgx1            1/1     Running   18         28h   10.144.47.136   sm-mgx1                 <none>           <none>
    
    kube-sriov-device-plugin-sgjl2    1/1     Running   0          7h    10.144.47.137   localhost.localdomain   <none>           <none>
    
    kube-sriov-device-plugin-vgnkt    1/1     Running   0          7h    10.144.47.136   sm-mgx1                 <none>           <none>
    
  3. Check the kube-sriov-device-plugin logs running on the Host Node.

    Run the following command to check:

    kubectl logs pod/kube-sriov-device-plugin-vgnkt -n kube-system

    Look for the configmap in the logs:

    "resourceList": [
    
            {
    
                 "resourceName": "bf3_p0_sf",
    
                  "resourcePrefix": "nvidia.com",
    
                  "deviceType": "auxNetDevice",
    
                  "selectors": [{
    
                      "vendors": ["15b3"],
    
                      "devices": ["a2dc"],
    
                      "pciAddresses": ["0000:03:00.0"],
    
                      "pfNames": ["p0#1"],
    
                      "auxTypes": ["sf"]
    
                  }]
    
              },
    
              {
    
                 "resourceName": "bf3_p1_sf",
    
                  "resourcePrefix": "nvidia.com",
    
                  "deviceType": "auxNetDevice",
    
                  "selectors": [{
    
                      "vendors": ["15b3"],
    
                      "devices": ["a2dc"],
    
                      "pciAddresses": ["0000:03:00.1"],
    
                      "pfNames": ["p1#1"],
    
                      "auxTypes": ["sf"]
    
                  }]
    
              }
    
        ]
    
    }
    
    I1001 09:47:28.773307       1 main.go:68] Initializing resource servers
    
    I1001 09:47:28.773313       1 manager.go:117] number of config: 2
    
    I1001 09:47:28.773327       1 manager.go:121] Creating new ResourcePool: bf3_p0_sf  <----- Trying to create
    
    I1001 09:47:28.773332       1 manager.go:122] DeviceType: auxNetDevice
    
    W1001 09:47:28.774010       1 auxNetDeviceProvider.go:61] auxnetdevice GetDevices(): error creating new device mlx5_core.ctl.0 PCI 0000:01:00.0: "cannot get sfnum for mlx5_core.ctl.0 device: stat /sys/bus/auxiliary/devices/mlx5_core.ctl.0/sfnum: no such file or directory"
    
    W1001 09:47:28.774384       1 auxNetDeviceProvider.go:61] auxnetdevice GetDevices(): error creating new device mlx5_core.eth.0 PCI 0000:01:00.0: "cannot get sfnum for mlx5_core.eth.0 device: stat /sys/bus/auxiliary/devices/mlx5_core.eth.0/sfnum: no such file or directory"
    
    …
    
    I1001 09:47:28.879397       1 manager.go:138] initServers(): selector index 0 will register 0 devices
    
    I1001 09:47:28.879523       1 manager.go:142] no devices in device pool, skipping creating resource server for bf3_p0_sf <--- If you see this, then the SF resource is NOT created.
    

    If configmap is not found, check whether the scalable functions are created properly.  Also, verify that the Network Attachment Definition Custom Resources are created.

    kubectl get net-attach-def -A

    NAMESPACE   NAME          AGE
    
    f5-spk      sf-external   41s
    
    f5-spk      sf-internal   41s
    

ERROR: VLAN

While trying to create VLAN, the following error is observed:

octo@arm64-sm-2:~/orchestrator$ kubectl apply -f vlan.yaml

Error from server (InternalError): error when creating "vlan.yaml": Internal error occurred: failed calling webhook "f5validate.f5net.com": failed to call webhook: Post "https://f5-validation-svc.default.svc:5000/f5-validator?timeout=10s": tls: failed to verify certificate: x509: certificate has expired or is not yet valid: current time 2024-10-18T18:48:38Z is after 2024-10-18T05:28:30Z

FIX:

  1. Check whether the clusterissuer is created. To check, run the following command:

    kubectl get clusterissuer
    
    No resources found
    

    _images/spk_info.png Note: The error from validating webhook indicates that the webhook cannot authenticate with the api server. This is from validating webhook and not the conversion webhook.

  2. Check for the CRD installation values. If the values are placed in wrong direction, go back to validating webhook.

  3. Set the annotation properly in validatingwebhookconfiguration for cainjector to inject the ca to the api server.

  4. Check for validatingwebhookconfiguration.

    kubectl get validatingwebhookconfiguration f5validate-default -o yaml | grep cert
    
    cert-manager.io/inject-ca-from: default/tls-f5ingress-webhookvalidating-svr
    
  5. Check for secrets which should contain the ca. Run the following command in the terminal:

    kubectl get secret  | grep tls-f5ingress-webhookvalidating-svr
    

    Sample output:

    tls-f5ingress-webhookvalidating-svr-98clw         Opaque                           1      20m
    
    tls-f5ingress-webhookvalidating-svr-secret        kubernetes.io/tls                3      26d
    
    • The first value in the sample output indicates that the issuer/clusterissuer is not ready. Hence, the secret name is appended with a random number.

    • The second value indicates that the secrets are not properly deleted in the previous installation.

      Delete the secrets with helm uninstall f5ingress command or delete the spkinstance CR.

    • The secrets stay in the cluster though the pods or deployments are deleted manually before.

  6. Clean-up the environment and ensure that all the tls-* secrets are deleted.

  7. Configure the ca secret and create the clusterissuer for Cert Manager. To create the clusterissuer, follow the instructions provided in Create Clusterissuer and Certificates guide.

ERROR: Cannot join DPU node to Kubernetes cluster

The following error may occur while trying to join DPU node to the Kubernetes cluster:

ubuntu@localhost:~$ sudo kubeadm join 10.144.47.34:6443 --token xed0f5.9pvw5csqu9whfup0 --discovery-token-ca-cert-hash sha256:9bb2107501225c0613d30c5bb1b47b89aa3457c86f4708cce9354ef225e06496

[preflight] Running pre-flight checks

error execution phase preflight: [preflight] Some fatal errors occurred:

	[ERROR FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sys/net/bridge/bridge-nf-call-iptables does not exist

[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`

To see the stack trace of this error execute with --v=5 or higher

FIX:

Set modprobe br_netfilter

  sudo su 

  modprobe br_netfilter

  echo 1 > /proc/sys/net/bridge/bridge-nf-call-iptables

  echo 1 > /proc/sys/net/ipv4/ip_forward

For more information on this error and Troubleshooting steps, refer Kubernetes Community Forums.

ERROR: DPU does not have management IP for internet access

When the DPU does not have a management interface configured, it will not have access to the internet nor to the Host Node.

FIX:

Change K8S to advertise on tmfifo_net0 interface and add iptable rule to send DPU traffic through the Host.

On the host

  1. Run the following command on the Host:

    iptables -t nat -I POSTROUTING -o eno1 -j MASQUERADE

  2. The fix is to have Kubernetes advertise on the RShim tmfifo_net0 interface.

    Start Kubernetes with --apiserver-advertise-address=192.168.100.1 to advertise the kubernetes API at the tmfifo_net0 interface so the DPU can reach the kubernetes api.

    sudo kubeadm init --apiserver-advertise-address=192.168.100.1 --pod-network-cidr=10.244.0.0/16

    Route all traffic through the en01 interface of the Host Node.

  3. Change the DNS server to 1.1.1.1 on the DPU node. To change, edit the /etc/netplan/50-cloud-init.yaml file and change the nameservers for oob_net0 to 1.1.1.1.

    vi /etc/netplan/50-cloud-init.yaml

    # This file is generated from information provided by the datasource.  Changes
    
    # to it will not persist across an instance reboot.  To disable cloud-init's
    
    # network configuration capabilities, write a file
    
    # /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
    
    # network: {config: disabled}
    
    network:
    
        ethernets:
    
            oob_net0:
    
                dhcp4: true
    
            tmfifo_net0:
    
                addresses:
    
                - 192.168.100.2/30
    
                dhcp4: false
    
                nameservers:
    
                    addresses:
    
                    - 1.1.1.1 
    
                routes:
    
                -   metric: 1025
    
                    to: 0.0.0.0/0
    
                    via: 192.168.100.1
    
        renderer: NetworkManager
    
        version: 2
    
  4. Now, run the following command:

    netplan apply

ERROR: multus error

The following error may occur if the multus CNI plugin is not installed properly.

kubectl get pods -A -owide

Sample output

...

kube-system    kube-multus-ds-jt9jk                            0/1     Init:CrashLoopBackOff    5 (2m32s ago)     5m25s   10.146.45.144   arm64-sm-2              <none>           <none>

kube-system    kube-multus-ds-vmfws                            1/1     Running                  0                 9d      10.146.45.145   localhost.localdomain   <none>

FIX:

Try the following method to troubleshoot multus pod CrashLoopBackOff error.

kubectl describe pod/kube-multus-ds-jt9jk -n kube-system
...

Init Containers:

  install-multus-binary:

    Container ID:  containerd://841c0318f6265009e82bc6c82f41f9e27ebb80bde1ae31c6c4c6693c35562791

    Image:         ghcr.io/k8snetworkplumbingwg/multus-cni:snapshot-thick

    Image ID:      ghcr.io/k8snetworkplumbingwg/multus-cni@sha256:6879c6efc5dddd562617168f9394480793a26db34180fb19966c8fdd98022bb4

    Port:          <none>

    Host Port:     <none>

    Command:

        cp

        /usr/src/multus-cni/bin/multus-shim

        /host/opt/cni/bin/multus-shim

    State:       Waiting

      Reason:    CrashLoopBackOff

    Last State:  Terminated

      Reason:    Error

      Message:   cp: cannot create regular file '/host/opt/cni/bin/multus-shim': Text file busy

      Exit Code:    1

      Started:      Mon, 28 Oct 2024 16:39:22 +0000

      Finished:     Mon, 28 Oct 2024 16:39:22 +0000

ERROR: BIG-IP Next for Kubernetes not starting up

kubectl describe pod/spkinstance-sample-f5ingress-5d7b978f86-xxw9k

Events:

  Type     Reason       Age                    From               Message

  ----     ------       ----                   ----               -------

  Normal   Scheduled    4m11s                  default-scheduler  Successfully assigned default/spkinstance-sample-f5ingress-5d7b978f86-xxw9k to sm-mgx2

  Warning  FailedMount  4m10s (x2 over 4m10s)  kubelet            MountVolume.SetUp failed for volume "tls-f5-lic-helper-grpc-svr-volume" : secret "tls-f5-lic-helper-grpc-svr-secret" not found

  Warning  FailedMount  4m10s (x2 over 4m10s)  kubelet            MountVolume.SetUp failed for volume "tls-f5ingress-grpc-clt-volume" : secret "tls-f5ingress-grpc-clt-secret" not found

  Warning  FailedMount  4m9s (x3 over 4m10s)   kubelet            MountVolume.SetUp failed for volume "external-f5ingotelsvr-volume" : secret "external-f5ingotelsvr-secret" not found

  Warning  FailedMount  4m9s (x3 over 4m10s)   kubelet            MountVolume.SetUp failed for volume "tls-tpm-grpc-svr-volume" : secret "tls-tpm-grpc-svr-secret" not found

  Warning  FailedMount  4m9s (x3 over 4m10s)   kubelet            MountVolume.SetUp failed for volume "tls-tpm-grpc-clt-volume" : secret "tls-tpm-grpc-clt-secret" not found

  Warning  FailedMount  4m9s (x3 over 4m10s)   kubelet            MountVolume.SetUp failed for volume "tls-qkviewfluentbitcontroller-grpc-svr-volume" : secret "tls-qkviewfluentbitcontroller-grpc-svr-secret" not found

  Warning  FailedMount  4m9s (x3 over 4m10s)   kubelet            MountVolume.SetUp failed for volume "tls-f5-controller-grpc-svr-volume" : secret "tls-f5-controller-grpc-svr-secret" not found

  Warning  FailedMount  4m9s (x3 over 4m10s)   kubelet            MountVolume.SetUp failed for volume "tls-f5-lic-helper-amqp-clt-volume" : secret "tls-f5-lic-helper-amqp-clt-secret" not found

  Warning  FailedMount  4m9s (x3 over 4m10s)   kubelet            MountVolume.SetUp failed for volume "tls-f5ingress-webhookvalidating-svr-volume" : secret "tls-f5ingress-webhookvalidating-svr-secret" not found

ERROR: TMM OOMKilled/Crashloopback

In the TMM pod, the following OOMKilled or Crashloopback error is observed.

default              f5-tmm-dpu-594f5f9d4-lp8mq                     3/4     OOMKilled

Fix:

If you observe OOMKilled or Crashloopback error in TMM pod, perform the following steps to fix this issue:

  1. Check TMM Deployment configuration.

    Sample output

    - name: USE_PHYS_MEM
    
      value: "true"
    
    - name: TMM_GENERIC_SOCKET_DRIVER
    
      value: "false"
    
    - name: TMM_MAPRES_ADDL_VETHS_ON_DP
    
      values: "false"
    
  2. Remove both the TMM_GENERIC_SOCKET_DRIVER and TMM_MAPRES_ADDL_VETHS_ON_DP from the configuration.

ERROR: Persistence Problems

The fluend and CWC pods have unbound immediate PersistentVolumeClaims which could cause an error.

FIX:

If Persistence is enabled in fluentd or CWC, the user must create the persistent volumes for these objects to depend on.

  • For fluentd:

    If there is a PVC with \"f5-toda-fluentd\" name in the toda pod, a persistent volume for this claim has to be created.

  • For CWC: If there is a PVC with \"cluster-wide-controller\" name in the toda pod, a persistent volume for this claim has to be created.

TIP: Inspect TMM start command to verify thread count

On the Host

octo@2m2d0t0191:~/orchestrator$ kubectl exec -it deploy/f5-tmm -c debug -- bash

/ps aufx

...

root          15 24.8  0.6 70516580 225184 ?     SLl  19:58  32:03  \_ tmm.0 --cpu 0-1 --memsize 3072 --physmem --platform Z100 --ignore-bigstart -e --temp-platform-Z102

65535          1  0.0  0.0    792     4 ?        Ss   19:58   0:00 /pause

ERROR: f5ingress and f5tmm terminating

If the f5ingress and f5tmm pods are terminating, try the following steps to troubleshoot this issue.

Fix:

  1. Check the spkinstance-resource.yaml for controller watchNamespace:

    controller:
    
        watchNamespace: "app-ns"
    

    Here the watchNamespace is app-ns.

  2. Check if the namespace is created in the cluster

      kubectl create namespace app-ns