SPK Fixes and Known Issues

This list highlights fixes and known issues for this SPK release.

SPK Release Notes SPK Release Information

Version: 1.7.10
Build: 1.7.10

Note: This content is current as of the software release date
Updates to bug information occur periodically. For the most up-to-date bug data, see Bug Tracker.


Cumulative fixes from SPK v1.7.10 that are included in this release
Known Issues in SPK v1.7.10



Cumulative fixes from SPK v1.7.10 that are included in this release

ID Number Severity Links to More Info Description
1623645 2-Critical The CWC retry connection is blocked.
1617289-1 2-Critical ZebOS can advertise a route with a nexthop of 0.0.0.0 if the interface lookup fails.
1623801-1 3-Major A TMM crash is seen with Mellanox NICs when traffic exceeding the default MTU is interspersed with regular traffic

Cumulative fix details for SPK v1.7.10 that are included in this release


1623645 : The CWC retry connection is blocked.

Component: SPK

Symptoms:
During the upgrade from version 1.7.8 to 1.7.9, the CWC pod successfully connected to RabbitMQ but quickly lost the connection. This caused a race condition where the system's logic for starting and stopping a process got out of sync. When the connection dropped, the system tried to stop a process before it had even started, leading to an error that blocked further actions, including the attempt to reconnect.

Conditions:
The RabbitMQ pod was immediately restarted while the CWC pod was still booting up.

Impact:
CWC is not able to retry communication with RabbitMQ.

Workaround:
Restart CWC pod.

Fix:
Ensure that communication is synchronized across all channels.

Summary of Code Level Changes:
Synchronized the Heartbeat channel to ensure thread safety while stopping and starting the Heartbeat Ticker to avoid failures.


1617289-1 : ZebOS can advertise a route with a nexthop of 0.0.0.0 if the interface lookup fails.

Component: SPK

Symptoms:
ZebOS can advertise a route with a nexthop of 0.0.0.0 if the interface lookup fails. This lookup may fail in certain rare circumstances triggered by a TMM restart.

Conditions:
Occurred when a network interface lookup by IPv4 address failed, and no suitable interface was found.

Impact:
Traffic will not be routed properly on the peer router that received the route.

Workaround:
Restarting the pod will fix the advertised route.

Fix:
In the ZebOS codebase, the IPv4 nexthop is now assigned a default value instead of being initialized to zero. If an interface lookup fails, it will now use the default value instead of zero. Additional logging was added in case the nexthop is 0.0.0.0 again.

Summary of Code Level Changes:
If an interface lookup fails, it will now have the default value instead of zero. Additional logging was added should nexthop be 0.0.0.0 again.


1623801-1 : A TMM crash is seen with Mellanox NICs when traffic exceeding the default MTU is interspersed with regular traffic

Component: SPK

Symptoms:
TMM crashes occur on the ingress side when using Mellanox CX5/CX6 NICs with the DPDK Mellanox MLX5 driver.

Conditions:
The DPDK MLX5 driver causes a crash in TMM after it receives a packet that exceeds its MTU, followed by a stream of legitimate traffic (~30 MB). This issue only affects NICs that use the DPDK MLX5 driver, such as ConnectX-5 and ConnectX-6.

Impact:
TMM crashes, traffic disruption can occur.

Workaround:
None

Fix:
The MLX5 driver in the DPDK library has been fixed to correctly handle oversized packets.

Summary of Code Level Changes:
Upgrade DPDK from 20.11.4->20.11.10.



Known Issues in SPK v1.7.10


SPK Issues

ID Number Severity Links to More Info Description
1612869 2-Critical There is a disruption to TMM traffic flow when all sentinels start up and the configured master DB is down
1602085-2 3-Major Data in configmap used for persistence by f5ingress can become temporarily stale after tmm scaling events
1581649-3 3-Major TMMs receive duplicate self IPs


Known Issues details for SPK v1.7.10

1612869 : There is a disruption to TMM traffic flow when all sentinels start up and the configured master DB is down

Component: SPK

Symptoms:
In a scenario where all the sentinels start up freshly (e.g., scaling down the sentinels to 0 followed by scaling up), the sentinels require the configured master DB to be up and functional. This is necessary for the sentinels to gather complete information about all the configured databases. If the configured master DB is down during the sentinels' startup, the sentinels fail to retrieve the complete database information and, hence, fail to create/expose the master/replica DB framework for SPK until the master DB is up and running.

Conditions:
The configured master DB is consistently not accessible or down during the startup of all sentinels.

Impact:
TMM is not able to establish communication with the Redis DB, which causes a disruption to traffic flow.

Workaround:
The mitigation is to scale down both DB and Sentinel pods to 0 and then scale them up to 3 using the steps below:
1. Scale down DB pods to 0:
  oc scale statefulset/f5-dssm-db --replicas=0 -n <namespace>
2. Scale down Sentinel to 0:
 oc scale statefulset/f5-dssm-sentinel --replicas=0 -n <namespace>
3. Scale up DB to 3:
 oc scale statefulset/f5-dssm-db --replicas=3 -n <namespace>
4. Scale up Sentinel to 3:
 oc scale statefulset/f5-dssm-sentinel --replicas=3 -n <namespace>


1602085-2 : Data in configmap used for persistence by f5ingress can become temporarily stale after tmm scaling events

Component: SPK

Symptoms:
The configmap is used by f5ingress controller to persist TMM specific config assignments like selfIP, snatIP and more. The configmap data could be out of sync with the state of the system.

Conditions:
Multiple TMM pod scaling events are occurring in quick succession, such that their processing overlaps. And then f5ingress is restarted before it is able to get persistmap back in sync.

Impact:
If f5ingress is restarted while persistmap is stale, duplication of selfips and snat routes can occur.

Workaround:
The Product Engineering (PE) team and Professional Services (PS) team are working on enhancing Kubernetes cron job script to identify error conditions involving duplicate config, which could help recover from bad state. Please contact F5 for cron job script.


1581649-3 : TMMs receive duplicate self IPs

Component: SPK

Symptoms:
Multiple TMMs have the same self IPs.

Conditions:
The F5 Ingress Controller uses a ConfigMap to persist the mapping of TMMs and their selfIP/snatIP assignments. After restarting, it uses the information from the ConfigMap to reconfigure TMMs with their previously assigned IPs. However, there are scenarios where the ConfigMap could be out of sync with the actual TMM assignments. For example:


Scenario 1:
Restarting the controller repeatedly with minimal intervals.

Scenario 2:
While multiple tmm pods restart concurrently, the controller restarts before the tmm pod restart is finished.

Scenario 3:
After the Vlan CR has been updated, the controller restarts immediately.

Scenario 4:
The TMM pods are scaled up or down while f5ingress is restarting.

Scenario 5:
An user with the appropriate RBAC privileges accidentally deletes the configmap and f5ingress restarts later.

Impact:
Multiple TMMs end up with the same self IPs, impacting the traffic.

Workaround:
A script run as a cron job periodically checks if the self IP assignment on TMMs is in sync with the VLAN CRs and configmap. If a discrepancy is detected, it takes corrective action. Please contact F5 for cron job script.


This issue may cause the configuration to fail to load or may significantly impact system performance after upgrade


*********************** NOTICE

For additional support resources and technical documentation, see:
*******************************