5.3. Troubleshooting SSL Orchestrator High Availability Issues

5.3.1. What it is

F5 SSL Orchestrator relies on a separate REST-based communication process between the peers to convey synchronization information. As with anything else, unexpected problems can arise usually due to configuration issues.



5.3.2. How to troubleshoot it

SSL Orchestrator provides a built-in high availability (HA) status utility to help in diagnosing HA communication issues. After selecting SSL Orchestrator -> Configuration, click the HA Status link in the top right corner. This will present a screen (on both devices) that displays the status of the various communication states.

../../_images/image853.png
../../_images/image863.png

Figure 85/86: SSL Orchestrator HA Status Dashboard


In the event that any of these are bad (red), or an SSL Orchestrator configuration has failed due to HA issues, follow the below matrix to troubleshoot the HA configuration. First, note the limitations imposed by SSL Orchestrator HA:

HA Limitations

User Input

HA mode

HA is restricted to two (2) devices in active/standby mode.

Sync mode

HA requires manual sync mode, with either full or incremental updates.

Device groups

HA supports one (1) device group.

Before configuring SSL Orchestrator, ensure that you note these prerequisites:

HA Configuration

User Input

BIG-IP version and provisioning

Both devices must be running the same BIG-IP version with the same licensing and modules provisioned.

Sync channel port lockdown

After selecting Network -> Self-IPs, ensure that the self-IP used for peer synchronization has the Port Lockdown set to either Allow All or Allow Default. SSL Orchestrator sync happens via REST communications on port 443.

Time synchronization

After selecting System -> Configuration -> Device -> NTP, ensure that both devices are configured to use NTP and that time is correct (synchronized) on both devices.

Initial config sync

Ensure that both devices are synced before deploying the SSL Orchestrator configuration.

Non-SSL Orchestrator objects

Ensure that any objects not created by SSL Orchestrator are created on both devices (ex. ingress/egress VLANs and self-IPs, SSL/TLS certificates).

Assuming the BIG-IP system is in a correct Active/Standby HA state, and the devices have been synchronized, the following troubleshooting matrix will guide you through the steps to troubleshooting SSL Orchestrator HA issues.

HA Troubleshooting Matrix

User Input

Virtual Edition device-id

If on a VE platform, ensure that the peer devices do not have the same device-id. From the BIG-IP command line, enter the following:

cat /etc/f5-rest-device-id

If the values are the same on both devices, delete them both and restart services to regenerate new (unique) values:

rm -f /etc/f5-rest-device-id bigstart restart restjavad

Gossip worker active state

The Gossip worker is responsible for notifying the peer of configuration changes. Ensure that the Gossip worker is in an Active state with the correct peer group set to “tm-shared-all-big-ips”. From the BIG-IP command line, enter the following:

restcurl shared/gossip

Observe the output and verify:

“status”: “Active”

“gossip”: “tm-shared-all-big-ips”

Perform this check on both devices. If the values are not as listed above, tear down and rebuild the BIG-IP system HA.

Review SSL Orchestrator HA logs

Look for “Gossip” related warnings in the Restjavad log. From the BIG-IP command line, enter the following:

grep WARNING /var/log/restjavad.*.log | grep Gossip

Observe Gossip conflicts

When the Gossip worker cannot apply an update to the local worker, it adds a description of the problem in /shared/gossip-conflicts. To view this from the BIG-IP command line, enter the following:

restcurl shared/gossip-conflicts

Device group values

Ensure that SSL Orchestrator is using the correct management address for synchronization. From the BIG-IP command line, enter the following:

restcurl shared/resolver/device-groups /tm-shared-all-big-ips/devices

Verify that the output is the same on both devices. Specifically, observe the following:

“address”: “10.1.10.100”

“managementAddress”: “10.1.1.4”

The “address” value corresponds to the configSync IP configuration, and the “managementAddress” corresponds to the management IP address of each device in the device group. If the values are not as listed above, correct the configSync and management IP configurations in the BIG-IP system HA settings.

If gossip shows “UNPAIRED”, you may need to do the following on both devices:

Delete existing device information

restcurl -X DELETE shared/resolver/device-groups /tm-shared-all-bigips/devices

Force updating

restcurl -X POST -d ‘{}’ tm/shared/bigip-failover-state

Application ID

Check the value of the application ID on each device. From the BIG-IP command line, enter the following:

curl -sku ‘admin:admin’ https://localhost/mgmt/shared /iapp/global-installed-packages |jq

Replace “admin:admin” with the correct administrative username and password. In the output, verify that the ID value is the same on both devices. If they are not, delete all SSL Orchestrator configurations, uninstall the RPM, and re-install.

Failover state

The BIG-IP Failover worker detects device group and failover settings on the BIG-IP by continually polling these settings. It uses the REST framework’s Gossip mechanism to replicate configuration. Verify the failover state by entering the following on the BIG-IP command line:

restcurl tm/shared/bigip-failover-state

Observe the output and verify that values are the same (except for the “failoverState”, which should be “active” or “standby” on the active and standby peers, respectively). If the output values are not the same, trigger the Failover worker with the following command:

restcurl -X POST -d ‘{}’ tm/shared/bigip-failover-state

If the above fails, delete all SSL Orchestrator configurations, uninstall the RPM, and re-install.


You may also encounter various error messages if the SSL Orchestrator HA configuration is failing. The following matrix lists the error messages, description, and corresponding remediations.

Error Message

Description

Remediation

System Error Name

“Verification process failed, usually due to internal JavaScript errors or REST call errors.” + err

The whole verification process failed, usually due to internal JavaScript errors or REST framework errors.

HA configuration is likely corrupt, and must be deleted and recreated.

SYSTEM.ERROR

“Unable to retrieve devices”

REST call to CM_DEVICE_URI_PATH failed.

HA configuration is likely corrupt, and must be deleted and recreated.

NONE

“No self device defined. Please reconfigure device group.”

CM_DEVICE_URI_PATH return has issues.

HA configuration is likely corrupt, and must be deleted and recreated.

SYSTEM.NO_SELF_DEVICE

“Cannot find Management IP. Please set it up.”

CM_DEVICE_URI_PATH return has issues.

HA configuration is likely corrupt, and must be deleted and recreated.

SYSTEM.NO_MGMT_IP

“<DEVICE NAME> ConfigSync IP is not set. Please configure it.”

The device (self or peer) does not have ConfigSync IP set.

Configure the configsync IP address correctly.

SYSTEM.NO_CONFIGSYNC_IP

“<DEVICE NAME> has no Self IP configured. Please configure Self IP”

The device (self or peer) does not have any self IP.

Configure the configsync IP address correctly.

SYSTEM.NO_SELF_IP

“<DEVICE NAME> has no NTP server set. Please set NTP.”

The device has no NTP configured.

Configure system NTP settings correctly.

CONFIG.NO_NTP

“<DEVICE NAME> has no DNS server set. Please set DNS.”

The device has no DNS configured.

Configure system DNS settings correctly.

CONFIG.NO_DNS

“<DEVICE NAME> Self-IP <SELF-IP> is locked. Adjust Port Lockdown to ‘Allow All’ or ‘Allow Default’”

Config Sync IP has wrong Port lockdown setting.

Set configsync Self-IP port-lockdown to ‘Allow Default’ or ‘Allow All’.

CONFIG.WRONG_PORT_LOCKDOWN

“<DEVICE NAME> has BIGIP or SSLO version different with this box. If it’s not the case, check again few seconds later.”

Currently the device has different BIG-IP/SSLO version. For SSLO, it needs some time to sync new installed RPM to its peers.

If it’s not the case, check again few seconds later.

CONFIG.SSLO_SOFTWARE_VERSION_MISMATCH

“There should be at least one standby device in SSLO. Please adjust configuration.”

Cannot find standby device, usually happen when all devices are active, or some devices are offline.

In case of HA, check sync status and devices details, pay attention to the failoverState.

CONFIG.WRONG_NUMBER_OF_STANDBY

“Only one active device is supported by SSLO. Please adjust configuration.”

More than one active devices are found.

This is not supported by SSLO. Change configync to active-standby mode.

CONFIG.WRONG_NUMBER_OF_ACTIVE

“<DEVICE NAME> not present in Failover Group members.”

Device Management <DEVICE NAME> Members list is different that CM Devices List .

Go to Device Management, edit your sync-failover group and ensure the correctly named devices are included.

CONFIG.WRONG_SYNC_FAILOVER_GROUP_MEMEBERS

“Sync failover device group not present in the response from <DEVICE GROUP URI PATH>. Check the sync failover device group details.”

Cannot find “sync-failover” device group.

Configure a sync-failover HA group.

GOSSIP.DEVICE_GROUP_NOT_PRESENT

“Sync-failover device group not present. Check sync-failover (HA) device group details.”

Cannot find “sync-failover” device group name.

Try to reset the sync-failover (HA) device group.

GOSSIP.SYNC_FAILOVER_GROUP_NOT_FOUND

“Gossip is not active. Check all other errors for suggestion.”

Gossip status is not “Active”

Check all other errors for suggestions.

GOSSIP.NOT_ACTIVE

“Missing device <DEVICE NAME> (<DEVICE MGMT IP>) in resolver tm-shared-all-big-ips.”

Lack of the device in Gossip devices result.

Check sync-failover (HA) devices group: 1. All devices are there; 2. Sync mode: Manual with Incremental Sync. Try to force update the gossip status by POST {} to tm/shared/bigip-failover-state.

GOSSIP.DEVICE_MISMATCH

“Echo verification failed for <DEVICE PEER>. Please wait 10 seconds and check again. And check other errors.”

Echo content of peer device does not match with local content.

HA configuration is likely corrupt, and must be deleted and recreated.

GOSSIP.ECHO_FAILED

“ConfigSync IPs are not matching for <DEVICE NAME>”

This usually happens when the configsync IP is modified after HA is set.

Check sync-failover (HA) devices group: 1. All devices are there; 2. Sync mode: Manual with Incremental Sync. Try to force update the gossip status by POST {} to tm/shared/bigip-failover-state.

GOSSIP.DEVICE_CONFIGSYNC_IP_MISMATCH

“REST framework version mismatch between HA devices. If it’s not the case, check again few seconds later.”

REST framework version is supposed to be same with BIG-IP’s version.

The same BIG-IP and SSLO versions are required on both peers. Check versions and upgrade accordingly.

GOSSIP.DEVICE_REST_VERSION_MISMATCH

“Auto sync detected for group <FAILOVER GROUP>. Change it to manual with incremental sync.”

Auto sync detected.

Change it to manual with incremental sync.

GOSSIP.DEVICE_GROUP_AUTOSYNC

“Network failover is disabled for sync failover group: <FAILOVER GROUP>. Recreate the sync failover device group.”

Failover must be enabled. Check the network failover unicast address.

Recreate the sync failover device group.

GOSSIP.DEVICE_GROUP_FAILOVER_DISABLED

“Unable to retreive sync-failover information.”

REST call to BIGIP_FAILOVER_STATE_URI_PATH failed

HA configuration is likely corrupt, and must be deleted and recreated.

GOSSIP.UNABLE_TO_GET_SYNC_FAILOVER

“Sync-failover is not enabled. Please enable it.”

Failover must be enabled. Check the network failover unicast address.

Enable sync failover.

GOSSIP.SYNC_FAILOVER_NOT_ENABLED

“FailoverState mismatch with device information. Possible reason: update frequency is low for bigip failover state worker. Try to reset sync-failover (HA) device group.”

Possible reason: update frequency is low for BIG-IP failover state worker. Try to reset sync-failover (HA) device group.

Try to reset sync-failover (HA) device group.

GOSSIP.SYNC_FAILOVER_DEVICE_INFO_MISMATCH

“Unable to verify sync status.”

REST call to <DEVICE GROUP URI PATH> failed

HA configuration is likely corrupt, and must be deleted and recreated.

NONE

“Sync issue detected. Check sync status. Click sync button if it is needed.”

CMI sync issue. This warning is generated when some devices are not in “pending” or “in sync” state.

GOSSIP.SYNC_ISSUE



5.3.3. Solve HA issues with the SSLOFIX script

SSL Orchestrator users with an HA setup may use the sslofix tool to troubleshoot and fix HA setup issues (such as when some REST blocks are missing/out of sync, or even when MCP data is out of sync between devices). The sslofix script includes the diagnostic capability to identify potential issues and can print out all of the issues found with the HA setup. The ha-sync script can then perform a sync-up, which should fix those issues, and ensure that both devices are fully in sync (both in MCP and REST). See the following for additional details:

https://clouddocs.f5.com/sslo-troubleshooting-guide/main/sslofix.html


In SSL Orchestrator deployments, where a service has been created, you may need to manually create non-syncable network objects if they are missing on the peer device before using the ha-sync script. These network objects include VLANs, non-floating IPs, and route domains created for the service.

Below is the usage for the sslofix script:

# sslofix -h
Usage: sslofix [OPTIONS]...
  -d, --dryrun                  Dry-run (simulation) mode
  -D, --devicegroup NAME        Specifies the HA device group name
    , --diagnostic              Runs a diagnostic and attempts to detect possible HA sync problems
  -f, --force                   Enforces a more coercive HA sync (see README for details)
  -h, --help                    Displays help text
  -H, --host HA_PEER            Specifies the HA sync peer
  -m, --manual                  Manual (step-by-step) mode
  -s, --stand-alone             Runs on this stand-alone device that is not part of a failover group
  -t, --target [NAMES]...       Specifies the HA sync target(s) [ALL MCP REST]. Default: ALL
  -v, --verbose                 Provides additional (debug) information
  -V, --version                 Displays the current version of this script

Examples:
sslofix -D ha-failover -H 10.192.228.78 --diagnostic
sslofix -D ha-failover -H 10.192.228.78
sslofix --stand-alone --diagnostic
sslofix --stand-alone