Troubleshooting

Use this section to read about known issues and for common troubleshooting steps. To provide feedback on this documentation, you can file a GitHub Issue.

Note

For simplicity and illustration purposes, the examples below use BIG-IP’s bash shell to run curl locally on the BIG-IP itself and the utililty jq to pretty print the JSON output. However, any REST client will do. See the iControl REST API documenation for more information and Postman Collection for CFE for more examples.

Confirm CFE Installation


Confirm that the Cloud Failover Extension (CFE) is installed on both BIG-IP instances. See Download and Install Cloud Failover Extension for more information. On the BIG-IP system (via the bash shell), you can use the following command to verify that CFE is installed:


curl -su admin: -X GET http://localhost:8100/mgmt/shared/cloud-failover/info |  jq .

The expected ouput is a JSON payload containing the version. For example:

{
  "version": "1.14.0",
  "release": "0",
  "schemaCurrent": "1.14.0",
  "schemaMinimum": "0.9.1"
}

If the output is empty, CFE is not installed.


Confirm the trigger call is installed in both the /config/failover/tgactive and /config/failover/tgrefresh files. For example, files from a clean installation of CFE will have the following contents:

[admin@bigip01:Active:In Sync] ~ # cat /config/failover/tgactive
#!/bin/sh
#
# NOTE:
# This file will be installed in /config/failover/tgactive and it will
# be called by /usr/lib/failover/f5tgactive
#
# - This file is for customer additions for tasks
#   to be performed when failover goes to active state.
#
# - Refer to /usr/lib/failover/f5tgactive for more information
#


# Autogenerated by F5 Failover Extension - Triggers failover
i=0
while [ $i -le 50 ]; do
    if [[ "$(curl -u admin:admin --max-time 5 --connect-timeout 5 -s -o /dev/null -w '%{http_code}' http://localhost:8100/mgmt/shared/cloud-failover/info)" == "200" ]]; then
        break;
    fi
    sleep 3
    i=$(( $i + 1 ))
done
curl -u admin:admin --max-time 10 --connect-timeout 10 -d {} -X POST http://localhost:8100/mgmt/shared/cloud-failover/trigger

If the instances have been upgraded, confirm that any previous failover scripts are either disabled or removed.

For example, if the following command returns any output:

egrep '/usr/bin/f5-rest-node' /config/failover/tg*

The /usr/bin/f5-rest-node line should be removed. See the individual cloud provider sections in Configure Cloud Failover Extension for more details.


Confirm Network Configuration

CFE needs to make calls to the Cloud’s Metadata service (which live on a Link Local address) and it’s REST APIs, which are typically on the Internet.

  • In order to reach the Metadata service, you will need to enable Link Local addresses on the BIG-IP and potentially create a route out a particular interface on which the Metadata service is enabled.
  • In order to reach the Cloud’s REST APIs, you will need an upstream NAT service (For example: NAT Gateway, custom firewall, Public IP, etc). Depending the BIG-IP’s routing configuration, these calls will egress either the management’s or the external interface’s default route.

See the following articles for more information on BIG-IP routing:

as well as CFE’s individual cloud provider documentation in Configure Cloud Failover Extension sections for more information on the specific configuration requirements for each cloud environment.


Warning

One of the most common installation issues is one has provided the required access to reach these Cloud APIs but only via the Management interface. However, in a subsequent step, one configures a default route out the external traffic interface (which per the above articles, ends up taking precedence). This results in CFE not being able to reach the endpoints anymore.


You can use curl to test BIG-IP’s access to these endpoints. By default, curl and CFE will use the host’s routing table and if it works with curl, that is an indicator CFE will work as well. For example:

curl <url>

If the default curl command is not working, you can use the --interface flag to verify wether the BIG-IP can reach a service via a particular expected interface. For example:

curl --interface mgmt <url>
curl --interface external <url>

Replace external with name of your external vlan name and <url> with the service you’re testing. If it works via the mgmt interface but not the external, you have a default traffic route and will need to enable access through the external network instead. However, if neither of these works, check your DNS configuration, upstream NAT’ing service, and/or cloud security group/firewall outbound rules.


  1. Confirm that CFE can contact the metadata service

AWS:


IMDSv1:

curl http://169.254.169.254/latest/meta-data/

IMDSv2:

TOKEN=`curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"` \
&& curl -H "X-aws-ec2-metadata-token: $TOKEN" -v http://169.254.169.254/latest/meta-data/

Azure:

curl -s -H "Metadata:true" "http://169.254.169.254/metadata/instance?api-version=2018-10-01" | jq .

GCP:

curl -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance

  1. Confirm that CFE can contact the Cloud REST APIs

NOTE: the example curl commands below don’t authenticate but you should receive a 4XX code response that confirms general network connectivity.


AWS:

curl -s -I  https://ec2.amazonaws.com

See AWS API Documentation for more information.

Azure:

curl -s -I https://management.azure.com/

See Azure API Documentation for more information.

GCP:

curl -s -I https://compute.googleapis.com

See GCP API Documentation for more information.

If you are unable to reach the Cloud’s REST APIs, you may need to check your DNS settings or upstream NAT service, Public IP, proxy or other service to allow the BIG-IP to reach the cloud APIs. See the individual cloud provider sections in Configure Cloud Failover Extension or CFE in Isolated Environments for more details.


Confirm CFE Configuration

  • Confirm that the Cloud Failover Extension (CFE) is configured

    curl -su admin: -X GET http://localhost:8100/mgmt/shared/cloud-failover/declare |  jq .
    

    The expected ouput is a JSON payload containing the declaration. For example:

    {
      "message": "success",
      "declaration": {
        "schemaVersion": "1.0.0",
        "class": "Cloud_Failover",
        "controls": {
          "class": "Controls",
          "logLevel": "silly"
        },
      ...
      <output shortened for illustration purposes>
      ...
    }
    

    If the output is empty, CFE is not configured. Otherwise, examine the REST response:

    • A 400-level response will carry an error message. See the Troubleshooting Index below.
    • If the error message is missing, incorrect, or misleading, please let us know by filing a GitHub Issue.
    • Review the logs. (/var/log/restnoded/restnoded.log) to check for any errors.

    Note

    • A declaration needs to be posted at least once to each unit in order to initialize. After that, you only need to post to one unit and the CFE configuration will be synced to the other unit. If auto-sync is enabled, the CFE configurations will automatically sync. Otherwise, you can manually sync the units to sync the CFE config).

    Tip

    If you are deploying for the first time and having issues, F5 recommends deploying an example full-stack deployment template as a working baseline. This can potentially help you isolate and identify any issues/differences with your configuration and/or environment.


Troubleshooting and Debugging Unexpected Behavior


  1. Enable DEBUG Logging (OPTIONAL: Skip to step 2 if debug logging is already enabled)


    Set the CFE configuration Logging level to silly.

    • Download the config:

      curl -su admin: -X GET http://localhost:8100/mgmt/shared/cloud-failover/declare |  jq .declaration > cfe.json
      
    • Use your preferred editor to configure debug logging ("logLevel": "silly"). Click here to see an example control block with debug logging:

      vim cfe.json
      
    • Repost to update the config:

      curl -su admin: -X POST -d @cfe.json http://localhost:8100/mgmt/shared/cloud-failover/declare | jq .
      
    • Sync the BIG-IP config to the other unit (if auto-sync is not enabled).

  1. RESET the State file

    This only needs to be be run once on either unit.

    curl -su admin: -X POST -d '{"resetStateFile":true}' http://localhost:8100/mgmt/shared/cloud-failover/reset | jq .
    
  1. Run INSPECT (on *each* instance)

    Use the /inspect endpoint to validate the cloud resources (ip addresses and routes) currently associated with each unit.

    curl -su admin: http://localhost:8100/mgmt/shared/cloud-failover/inspect | jq .
    
  1. Run DRY-RUN (on the *STANDBY* instance only)

    Use /trigger endpoint to perform a DRY-RUN and validate which objects will be re-mapped when the Standby instance fails over to Active.

    curl -su admin: -X POST -d '{"action":"dry-run"}' http://localhost:8100/mgmt/shared/cloud-failover/trigger | jq .
    

    Tip

    Check the toActive section of the response to confirm the desired objects will be re-mapped. For example, if failing over a set of expected VIP addresses, confirm those addresses are present in the toActive section.

    curl -su admin: -X POST -d '{"action":"dry-run"}' http://localhost:8100/mgmt/shared/cloud-failover/trigger | jq .addresses.interfaces.associate
    
  2. Review the debug logs (/var/log/restnoded/restnoded.log) to check for any errors. For instance, the logs may reveal a particular service is unreachable, CFE does not have permissions to access a service, etc.

    grep f5-cloud-failover /var/log/restnoded/restnoded.log
    

    Important

    If opening a case with F5 Support, you will be asked to provide these logs with DEBUG enabled.


    If any of the above curl calls seem to hang or take a long time to respond, it may indicate that BIG-IP is unable to reach or communicate with the cloud provider’s APIs.


    You may see an network related errors in this log like:

    • Function error, retrying: getaddrinfo ENOTFOUND

      Recommendation: CFE can’t resolve the API address. Check the DNS settings on BIG-IP.

    • Function error, retrying: connect ECONNREFUSED

      Recommendation: CFE can’t contact the Cloud API. Check the upstream network environment, upstream NAT/Firewall service, any security/firewall rules, etc. per above.

    Or permissions related errors like:

    • Function error, retrying: Invalid token response, did not find tokenType.

    • Failover initialization failed: No storage account found! Error: No storage account found!

    • Where it hangs on the Listing Storage Accounts message

      Recommendation: Check that the BIG-IP’s RBAC Role is attached and its permissions are configured appropriately.

If the above commands return successfully and no obvious errors are found but the failover objects are not being discovered or remapped as expected, check the following:

  • If you are using the Explicit based configuration, ensure all desired objects are defined in the CFE declaration.
  • If you are using the Discovery via Tag based configuration, check to ensure any tags or labels used to identify the failover objects are present on the objects, that there are no typos like leading or trailing spaces, capitalization, extra-characters, etc. It may be helpful to use the cloud’s CLI to inspect the text output of the objects to help confirm there are no differences that may not be visually apparent in the Cloud’s GUI.
  • If failing over routes, ensure the route prefix is an exact string match. For a cloud route with prefix ‘192.168.1.100/24’, configuring a route scoping address of ‘192.168.1.0/24’ will not work.

In some cases, permission issues may not produce any errors but still prevent CFE from discovering or remapping objects. For example, the Cloud API will simply not return objects that the BIG-IP’s RBAC role does not have permissions for so double-check that the BIG-IP’s role has the necessary effective permissions through organizational policies, inheritance, etc.


Troubleshooting Index

Use this section for specific troubleshooting help.



I’m receiving a path not registered error when I try to post a declaration

If you are receiving this error, it means either you did not install Cloud Failover Extension, or it did not install properly. The error contains the following message:

{
    "code":404,
    "message": "Public URI path no registered. Please see /var/log/restjavad.0.log and /var/log/restnoded/restnoded.log for details.".
    ...
}

Recommendation: See Download and Install Cloud Failover Extension to install or re-install Cloud Failover Extension.




I’m receiving a 400 error when I try to post a declaration with no additional helpful message

If you are receiving this error, it typically means the provider prerequisites have not been met and there is an issue performing initialization operations.

Recommendation: Please review the provider prerequisites sections for more information.




I’m receiving a recovery operations are empty error when failover is triggered or I need to reset the state of my failover extension

If you receive this error, it means Cloud Failover Extension had a previous failure which left it in a bad state.

Recommendation: Manually ensure (either through the Cloud’s GUI or CLI) that all the desired resources are attached the to ACTIVE instance again (to get in the expected initial pre-failover state), then RESET the state file and run through the INSPECT and DRY-RUN debugging steps as described above to determine any descrepencies before attempting failing over again.




I see some of my failover objects are not re-mapped after failover

There could be a number of reasons why some of your failover objects are not re-mapped after failover.

Recommendation: First check the logs (/var/log/restnoded/restnoded.log) logs to see if there any transient or environmental discrepencies that may have led to the issue. For instance, due to load or network event, the Cloud API was slow or unvailable.

If you see some of your failover objects are not re-mapped after failover, this could also mean Cloud Failover Extension was unable to discover or remap those objects. Check to make sure there were no changes to the environment that may have caused the objects to be discovered or mismatched.

Run through the INSPECT and DRY-RUN debugging steps as described above to help determine any descrepencies and gather debug logs.




I’m not seeing any logs in /var/log/restnoded/restnoded.log after a DRY-RUN or failover

Check that the trigger call was installed in the /config/failover/tgactive and /config/failover/tgrefresh files. If the trigger call is not present, the failover extension will not be able to trigger a failover or dry-run.

Recommendation: See the above section for more information on re-installing and confirming the needed trigger call.




I’m receiving a 404 error after upgrading the BIG-IP version

F5 is currently tracking this issue (929213).

Recommendation: Currently the workaround is that f5-cloud-failover RPM needs to be re-uploaded.




Failover objects are not mapped to the Active BIG-IP after a cluster reboot

After both BIG-IP VMs have been rebooted, sometimes failover objects are not mapped to the Active BIG-IP.

  1. BIG-IP 2 is Active (and has failover objects).
  2. Shutdown BIG-IP 1.
  3. Shutdown BIG-IP 2.
  4. Start BIG-IP 1.
  5. Wait 1 minute.
  6. Start BIG-IP 2.
  7. BIG-IP 1 should be Active (and have failover objects).

Failover under these conditions normally works as long as restnoded comes up before HA status is determined and tgactive is called.

If, during a reboot, the objects are mapped to the wrong BIG-IP, you can force a failover event by POSTing to the /trigger endpoint of the currently active BIG-IP.





Note

To provide feedback on Cloud Failover Extension or this documentation, you can file a GitHub Issue.