OpenShift with SR-IOV Network Adapters (Supported)

This guide will go detail how the OpenShift Cluster networks are connected and how to validate the environment to prepare for the SPK Installation.

Environment Design

This section will detail the OpenShift environment where SPK is deployed to a worker node with Single Root I/O Virtualization (SR-IOV) Network Cards.

External clients will connect to the external SR-IOV interface on the SPK Worker Node. That is then routed out the internal SR-IOV interface from the SPK Worker Node to the application deployed on the Workload Worker Node through a secondary bridge network (a ovn-kubernetes config).

base environment

Note the

Environment requirements

SPK 1.6.0 relies on Redhat OpenShift 4.10. OpenShift is an opinionated enterprise version of Kubernetes which requires a control plane cluster and dedicated worker nodes.

SPK leverages Single Root I/O Virtualization (SR-IOV) and the Open Virtual Network with Kubernetes (OVN-Kubernetes) CNI. In addition, OpenShift provides support for MULTUS, a CNCF project, with iCNI 2.0. MULTUS is considered a meta plugin, as it supports adding multiple CNI plugins to the same pod. This functionality is required for SPK because the f5-tmm requires two SR-IOV interfaces: one connected to an internal network and the other to the external network.

SR-IOV hardware support for SPK is limited by OpenShift. Be sure to verify the supported SR-IOV Network Cards to the ones you have.

You will also need to install the Performance Addon Operator (PAO) and enable Topology Manager. Topology Manager is a component of the Kubelet that works with the CPU and Device Manager to allocate the resources to assign to a pod and/or container. When configuring the Topology Manager keep in mind that the CPU resources assigned to the f5-tmm pod must be on the same NUMA node where the SR-IOV card is physically installed. In some environments the server may have multiple interfaces but they may not all be SR-IOV compatible such as integrated interfaces in the server.

While working with PAO you will also need to configure hugepages. Hugepages are memory pages that support chunks of data larger that 4Ki. Common hugepages default sizes are 2Mi and 1Gi.

Any nodes that will support SPK must have the iptable_raw kernel modules enabled.

You will also need to provide a Persistent Volume. This can be something local and simple such as the HostPath type or a deployed application like OpenEBS.

Once you have OpenShift 4.10 or later cluster up and running you should be able to deploy SPK to the cluster.

Environment validation

If you are using an existing OpenShift environment you may want to review the environment and validate the settings in place. You will need direct access to all the nodes in your cluster and cluster-admin access to a working OpenShift client to complete these tasks.

  1. Check the nodes

    Start with discovering all the nodes in your cluster and confirming that they are healthy and show a status of Ready. The status will also include the OpenShift node ROLE which defaults typically to master or worker. The nodes labeled with the master role make up the control plane for OpenShift. The worker nodes manage the other workloads for your environment. Additional node roles can be defined using labels. These labels can be used for other requirements such as configuring hugepages on the SPK node.

    oc get nodes 
    oc get nodes -o wide
    oc get nodes --show-labels
    
  2. Check Cluster Operators status

    The Cluster Operators are a group of components that support the base OpenShift environment. For example the Cluster Network Operator manages the network components and the Authentication operator manages authentication to and within the cluster. Problems with any of these Operators may impact SPK. Review the output from the clusteroperators command and ensure all the services in the AVAILABLE column are set to True and none indicate a DEGRADED status.

    oc get clusteroperators
    oc describe clusteroperators.config.openshift.io authentication
    
  3. Review system events

    When OpenShift displaying events they are not presented in a chronological order by default. You can control this to some extent using shell pipes and filters. Alternatively, you can use --sort-by= to filter the output based on the creation time stamp. It can be helpful to review all the namespaces for events and then to sort them by the number of issues reported. You can use the -A option to display all namespaces or the -n to select particular namespace. Using this information you can quickly focus on the namespaces with the most events and sort them by creation timestamp. The command examples below should provide these details for your environment.

    oc get events
    oc get events -A
    oc get events -A | awk '{print $1}' | grep -iv namespace | sort | uniq -c
    oc get events -A --sort-by=.metadata.creationTimestamp
    oc get events -n {NAMESPACE value from previous output} --sort-by=.metadata.creationTimestamp
    
  4. Validate hugepages resources

    By default, each instance of SPK start-up two TMM threads each allocating approximately 1.5GB of memory per thread for a total of 3GB for each SPK instance. SPK takes advantage of hugepages that need to be configured at 2MB per page. As a result by default each instance of SPK by default will consume 1500 HugePages or approximately 3GB of RAM. In the following examples you should be able to see a system with hugepages enabled and one without. The lead/control nodes typically do not require this feature but your SPK node should have hugepages enabled. Replace the node names in the examples with your environment node names. These values can also be confirmed locally on the host assuming you have ssh access to those nodes.

    Note: The number of TMM instances deployed is defined in the SPK override file.

    oc describe nodes <workerNodeName> | grep huge
    oc describe nodes <leadNodeName> | grep huge
    

    You can also login directly to the nodes and view the hugepage settings, available CPUs and Memory.

    oc debug node/<workerNodeName>
    chroot /host
    grep -i huge /sys/devices/system/node/node0/meminfo
    grep -i huge /sys/devices/system/node/node1/meminfo
    lscpu
    free -h
    
  5. NUMA, CPUs, RAM and SR-IOV

    It is likely that the nodes in your cluster are using Non-Uniform Memory Access (NUMA) systems with 2 multiple processors. If those systems also have non SR-IOV capable network cards then you should confirm that your topology manager is only allocating CPUs from the correct NUMA node. The network card will be connected via the bus to one of the CPU sockets. Memory on the system is arranged into zones which are allocated to specific CPU sockets. Only the memory connected to a CPU socket that is also connected to an SR-IOV enabled network interface should have hugepages enabled.

    The following commands provide details around the NUMA nodes and the CPUs. When you sign-in to your nodes you should only expect to see hugepages consumed on hosts with SPK installed and only for memory zones associated to an SR-IOV capable CPU socket.

    Start by reviewing the PCI details using lspci for a system with Mellanox cards installed. You may notice that you can increase the details from lspci using -v, -vv and, -vvvand then increase the detail further by elevating your rights using sudo or the root account directly.

    oc debug node/<workerNodeName>
    chroot /host
    lspci -vv
    lspci -vv | grep -i numa
    lspci -vv | grep -A 5 Mellanox
    grep -i huge /sys/devices/system/node/node0/meminfo
    grep -i huge /sys/devices/system/node/node1/meminfo
    

    Next review the CPU details using lscpu

    lscpu
    lscpu --all --extended=NODE,CPU
    

    The following iterator will display device to NUMA associations.

    for nic in $(ls /sys/class/net/*/device/numa_node|cut -d/ -f5|sed 's/v[0-9]*$//'|sort -u); do echo "$nic NUMA=$(cat /sys/class/net/$nic/device/numa_node)"; done
    

    You can also review the virtual function details for your interfaces, if SR-IOV is enabled. These interface details will not show on a host that does not have an SR-IOV capable card installed.

    In the command example below, ens3f1 is the SR-IOV network interface.

    ls -l /sys/class/net/ens3f1/device/virtfn*
    

    Next list the device to CPU associations

    for nic in $(ls /sys/class/net/*/device/numa_node|cut -d/ -f5|sed 's/v[0-9]*$//'|sort -u); do echo "$nic CPUs=$(cat /sys/class/net/$nic/device/local_cpulist)"; done
    

Environment Preparation for SPK

The SPK pod interacts with TMM pod over gRPC using an encrypted TLS tunnel. The SSL/TLS keys and certificates must be created and installed into your cluster. In addition you will need to add the default service account to your desired spk target namespace. The online F5 Service Proxy for Kubernetes documentation covers this in detail.

If you intend to work with the DNS46/NAT64 solution you will also need to create, configure, and add secrets for that solution for dSSM. This process is also online in the F5 Service Proxy for Kubernetes documentation and covered in detail.

You should already be aware of your SR-IOV compatible hardware and which interfaces are associated to those cards. The OpenShift administrator should have configured these cards already. These are part of a set of configuration objects that define the network configuration used by SPK. This collection starts with the SR-IOV Network Node Policy. This object controls the virtual functions provided by the SR-IOV capable network card. The number of supported virtual functions that a card can support is defined by the vendor and configured at the BIOS level of the host system. Validating and updating this setting will be unique to each hardware vendor.

Once this has been enabled and the system is rebooted, you should be able to confirm which nodes will support SPK. The following shell iterator should print these details out for you but it assumes your nodes include either worker or spk in the node name awk '/ worker || spk /{print $1}'. You may need to change those values for your environment.

nvl="$(oc get pod -A -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName|awk '/^f5-tmm-/{print $2}'|sort|uniq -c|awk '{print $2 ":" $1}')"; for node in $(oc get nodes|awk '/ worker || spk /{print $1}'); do vfm="$(oc describe sriovnetworknodestates.sriovnetwork.openshift.io -n openshift-sriov-network-operator $node 2>/dev/null|grep -m1 'Num Vfs:'|awk '{print $3}')"; [ -n "$vfm" ] && echo "$node: VFs $((echo "$nvl"|grep "^$node:" || echo ':0')|cut -d: -f2)/$vfm" || echo "$node: [SR-IOV DISABLED]"; done

Next review the SR-IOV Network Node Policies that are installed. When using SR-IOV with SPK you should have an internal and external SR-IOV network node policy. They should exist in the openshift-sriov-network-operator namespace but you can check all namespaces using the -A option. Once you find the policies, review them for details using the -o yaml option. The nodeSelector should be defined for the SR-IOV capable devices, this can be based on a kubernetes role or label.

oc get sriovnetworknodepolicies.sriovnetwork.openshift.io -A
oc get sriovnetworknodepolicies -n openshift-sriov-network-operator
oc get sriovnetworknodepolicies -n openshift-sriov-network-operator <yourSriovNetworkNodePolicieName> -o yaml

The creation of the network node policies should have generated the Network Attachment Definitions to match the internal and external sr-iov network node policies.

Now we need to create the SR-IOV Network objects. These objects are based on the SR-IOV custom resources installed when the SR-IOV hardware was added. These objects should have unique descriptive names and be associated to the target SPK namespace. These yaml files should also reference the SR-IOV resource name which should be defined in your sriov network node policy object.

Here is one way to find your SR-IOV resource name:

oc get sriovnetworknodepolicies -n openshift-sriov-network-operator -o "custom-columns=SR-IOV Resource Name:.spec.resourceName"

Using these values you should be able to create your SriovNetwork objects which might look something like the following example. The metadata name value is an attempt to provide details for this object. This is a development SPK ingress (dev-spk-ingress) instance using the interface ens3f0 for the internal connection.

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: dev-spk-ingress-ens3f0-internal
  namespace: openshift-sriov-network-operator
spec:
  networkNamespace: dev-spk-ingress
  resourceName: sriovens3f0int
  spoofChk: "off"
  trust: "on"
  capabilities: '{"mac": true, "ips": true}'