F5OS-A 1.0.0 - System Health: “tcpdump” and Other Troubleshooting Tools¶
Feature Overview¶
F5OS-A provides qkview, tcpdump, logging, and K3s and Platform Service health monitors to aid in troubleshooting
Feature deeper overview¶
Terminology:
K3s Health Monitoring: helps end users to monitor the installation, initialization and configuration of k3s in rSeries appliances.
Platform Services Monitoring: helps end users monitor platform service state, status, and components. The monitor watches each services and records the status of the service and its components.
Tcpdump: filters packets on internal interfaces and leverages the fpga. All native tcpdump options are supported.
There is no GUI for these items in this version.
Customer use case example¶
How do I run tcpdump?
K3s health¶
K3s health monitoring helps end users to monitor the installation, initialization and configuration of k3s in rSeries appliances.
There are three commands to see cluster health. Run these in the confd command line. The following commands show what “healthy” looks like.
cluster cluster-status summary-status "K3S cluster is initialized and ready for use."
cluster cluster-status cluster-status 0
status "2021-11-12 17:44:17.117056 - applianceMainEventLoop::Orchestration manager startup."
cluster cluster-status cluster-status 1
status "2021-11-12 17:44:17.122811 - Can now ping appliance-1.chassis.local (100.65.60.1)."
cluster cluster-status cluster-status 2
status "2021-11-12 17:44:17.536843 - Successfully ssh'd to appliance 127.0.0.1."
cluster cluster-status cluster-status 3
status "2021-11-12 17:44:22.705285 - Appliance 1 is ready in k3s cluster."
cluster cluster-status cluster-status 4
status "2021-11-12 17:44:22.705377 - K3S cluster is ready."
cluster cluster-status cluster-status 5
status "2021-11-12 17:45:37.972604 - K3s RPM update is succeeded."
cluster cluster-status cluster-status 6
status "2021-11-12 17:47:26.226093 - K3s IMAGE update is succeeded."
show cluster events¶
% No entries found.
show service-pods¶
SERVICE POD POD POD
SERVICE CLUSTER SLOT POD RESTART POD IMAGE
SERVICE NAME CLUSTER IP PORT ID STATUS COUNT STATE POD MESSAGE VERSION
---------------------------------------------------------------------------------------------------------------
coredns 0 1 true 1 Running Running Successfully 1.8.3
kube-flannel 0 1 true 0 Running Running Successfully 0.13.0
kube-multus 0 1 true 0 Running Running Successfully 3.6.0
lb-port-443 0 1 true 0 Running Running Successfully v0.2.0
local-path-provisioner 0 1 true 3 Running Running Successfully v0.0.19
metrics-server 100.75.159.10 443 1 true 2 Running Running Successfully v0.3.6
pause 0 1 true 0 Running Running Successfully 3.1
traefik-ingress-lb 0 1 true 0 Running Running Successfully 2.4.8
virt-api 100.75.75.51 443 1 true 0 Running Running Successfully 1.0.17
virt-controller 0 1 true 1 Running Running Successfully 1.0.17
virt-handler 0 1 true 0 Running Running Successfully 1.0.17
Platform services health¶
Platform Services Monitoring helps end users monitor platform service state, status, and components. The monitor watches each services and records the status of the service and its components.
The following confd commands show how to enumerate the services and then check their health.
Enumerate all services. This command produces A LOT of output as it lists the health of each of the services. Only one of the services, alert-service, is shown in this example.
show system health components component appliance services¶
services appliance/services/alert-service
state name alert-service
state health ok
state severity info
NAME DESCRIPTION HEALTH SEVERITY VALUE UPDATED AT
-----------------------------------------------------------------------------------------------------------------------------------------------------
container:event:attach Container attach event ok info 0 2021-10-26T20:49:41Z
container:event:die Container die event ok info 0 2021-10-26T20:49:41Z
container:event:exec-create Container exec create event ok info 0 2021-10-26T20:49:41Z
container:event:exec-detach Container exec detach event ok info 0 2021-10-26T20:49:41Z
container:event:exec-die Container exec die event ok info 0 2021-10-26T20:49:41Z
container:event:exec-start Container exec start event ok info 0 2021-10-26T20:49:41Z
container:event:kill Container kill event ok info 0 2021-10-26T20:49:41Z
container:event:restart Container restart event ok info 0 2021-10-26T20:49:41Z
container:event:restart-last-hour Container restart count in the last hour ok info 0 2021-10-26T20:49:41Z
container:event:start Container start event ok info 0 2021-10-26T20:49:41Z
container:event:stop Container stop event ok info 0 2021-10-26T20:49:41Z
container:running Container running ok info false 2021-10-26T20:49:41Z
service:component:version Service component version ok info 2021-10-26T20:49:41Z
service:file-watcher-error Service file watcher error ok info false 2021-10-26T20:49:41Z
service:message-error Service health monitor error ok info 2021-11-12T17:44:27Z
service:message-error-count Service health monitor error count ok info 0 2021-10-26T20:49:41Z
service:overall-health Service reported overall health ok info Ok 2021-11-12T17:44:27Z
service:ready Service ready status ok info true 2021-11-12T17:44:27Z
service:restart-count Service total restart count ok info 0 2021-11-12T17:44:27Z
service:restart-last-hour Service restart count in the last hour ok info 0 2021-10-26T20:49:41Z
service:startup-timestamp Service startup timestamp ok info 2021-11-12T17:44:15.213482808Z 2021-11-12T17:44:27Z
service:version Service version ok info 3.7.1 2021-11-12T17:44:27Z
...
To specify a single service, name the specific service.
show system health components component appliance services appliance/services/diag-agent¶
services appliance/services/diag-agent
state name diag-agent
state health ok
state severity info
NAME DESCRIPTION HEALTH SEVERITY VALUE UPDATED AT
-----------------------------------------------------------------------------------------------------------------------------------------------------
container:event:attach Container attach event ok info 0 2021-10-26T20:49:41Z
container:event:die Container die event ok info 0 2021-10-26T20:49:41Z
container:event:exec-create Container exec create event ok info 0 2021-11-08T21:57:36Z
container:event:exec-detach Container exec detach event ok info 0 2021-10-26T20:49:41Z
container:event:exec-die Container exec die event ok info 0 2021-10-26T20:49:41Z
container:event:exec-start Container exec start event ok info 0 2021-11-08T21:57:36Z
container:event:kill Container kill event ok info 0 2021-11-12T17:39:56Z
container:event:restart Container restart event ok info 0 2021-11-12T17:44:27Z
container:event:restart-last-hour Container restart count in the last hour ok info 0 2021-10-26T20:49:41Z
container:event:start Container start event ok info 0 2021-10-26T20:49:41Z
container:event:stop Container stop event ok info 0 2021-10-26T20:49:41Z
container:running Container running ok info true 2021-11-15T21:31:27Z
service:component:version Service component version ok info 2021-11-12T17:44:27Z
service:file-watcher-error Service file watcher error ok info false 2021-10-26T20:49:41Z
service:message-error Service health monitor error ok info 2021-11-12T17:44:27Z
service:message-error-count Service health monitor error count ok info 0 2021-10-26T20:49:41Z
service:overall-health Service reported overall health ok info Ok 2021-11-12T17:44:27Z
service:ready Service ready status ok info true 2021-11-12T17:44:27Z
service:restart-count Service total restart count ok info 0 2021-11-12T17:44:27Z
service:restart-last-hour Service restart count in the last hour ok info 0 2021-10-26T20:49:41Z
service:startup-timestamp Service startup timestamp ok info 2021-11-12T17:44:15.095531479Z 2021-11-12T17:44:27Z
service:version Service version ok info 1.45.0 2021-11-12T17:44:27Z
“Appliance services” is one of the trees that contains system health, there are three trees:
firmware (show system health components component firmware)
hardware (show system health components component hardware)
services (show system health components component services)
Another example, this uses the “hardware” leaf for cpu:
show system health components component appliance hardware appliance/hardware/cpu¶
KEY NAME HEALTH SEVERITY NAME DESCRIPTION HEALTH SEVERITY VALUE UPDATED AT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
appliance/hardware/cpu CPU ok info cpu:core:temperature CPU core temperature (C) ok info 2021-10-26T20:49:28Z
cpu:state:fatal-error-fault Fatal error ok info 0 2021-11-15T21:33:17Z
cpu:state:fivr-fault FIVR Fault ok info 0 2021-11-15T21:33:17Z
cpu:state:hw-correctable-error-fault Hardware correctable error ok info 0 2021-11-15T21:33:17Z
cpu:state:internal-error-fault internal unrecoverable error ok info 0 2021-11-15T21:33:17Z
cpu:state:machine-check-error Machine check error ok info 0 2021-11-15T21:33:17Z
cpu:state:non-fatal-error-fault Non-fatal error ok info 0 2021-11-15T21:33:17Z
cpu:state:processor-hot-fault Processor hot Fault ok info 0 2021-11-15T21:33:17Z
cpu:state:thermal-trip-fault Thermal Trip Fault ok info 0 2021-11-15T21:33:17Z
rasdaemon:extlog:invalid-address RAS Extlog invalid address event ok info 2021-11-15T21:33:17Z
rasdaemon:extlog:master-abort RAS Extlog master abort event ok info 2021-11-15T21:33:17Z
rasdaemon:extlog:memory-sparing RAS Extlog memory sparing event ok info 2021-11-15T21:33:17Z
rasdaemon:extlog:mirror-broken RAS Extlog mirror broken event ok info 2021-11-15T21:33:17Z
rasdaemon:extlog:multi-bit-ecc RAS Extlog mullti-bit ECC event ok info 2021-11-15T21:33:17Z
rasdaemon:extlog:multi-symbol-chipkill-ecc RAS Extlog multi-symbol chipkill ECC event ok info 2021-11-15T21:33:17Z
rasdaemon:extlog:no-error RAS Extlog no error event ok info 2021-11-15T21:33:17Z
rasdaemon:extlog:parity-error RAS Extlog parity error event ok info 2021-11-15T21:33:17Z
rasdaemon:extlog:physical-memory-map-out-event RAS Extlog physical memory map-out event ok info 2021-11-15T21:33:17Z
rasdaemon:extlog:scrub-corrected-error RAS Extlog scrub corrected error ok info 2021-11-15T21:33:17Z
rasdaemon:extlog:scrub-uncorrected-error RAS Extlog scrub uncorrected error ok info 2021-11-15T21:33:17Z
rasdaemon:extlog:single-bit-ecc RAS Extlog single-bit ECC event ok info 2021-11-15T21:33:17Z
rasdaemon:extlog:single-symbol-chipkill-ecc RAS Extlog single-symbol chipkill ECC event ok info 2021-11-15T21:33:17Z
rasdaemon:extlog:target-abort RAS Extlog target abort event ok info 2021-11-15T21:33:17Z
rasdaemon:extlog:unknown RAS Extlog unknown event ok info 2021-11-15T21:33:17Z
rasdaemon:extlog:unknown-type RAS Extlog unknown type ok info 2021-11-15T21:33:17Z
rasdaemon:extlog:watchdog-timeout RAS Extlog watchdog timeout event ok info 2021-11-15T21:33:17Z
rasdaemon:mce:address-command-error RAS MCE address/Command error ok info 0 2021-11-15T21:33:18Z
rasdaemon:mce:generic-undefined-request RAS MCE generic undefined request ok info 0 2021-11-15T21:33:18Z
rasdaemon:mce:memory-read-error RAS MCE memory read error ok info 0 2021-11-15T21:33:18Z
rasdaemon:mce:memory-scrubbing-error RAS MCE memory scrubbing error ok info 0 2021-11-15T21:33:18Z
rasdaemon:mce:memory-write-error RAS MCE memory write error ok info 0 2021-11-15T21:33:18Z
rasdaemon:mce:processor-temp-throttling RAS MCE processor temperature throttling ok info 0 2021-11-15T21:33:18Z
rasdaemon:mce:unknown-event RAS MCE unknown error ok info 0 2021-11-15T21:33:18Z
v6h:cpu-fault:msmi-bit MSMI fault ok info 0 2021-11-15T21:33:17Z
v6h:power-domain:cpu:0p6v-vttabcd CPU_0P6V_VTT_ABCD power fault ok info 0 2021-11-15T21:33:17Z
v6h:power-domain:cpu:0p6v-vttefgh CPU_0P6V_VTT_EFGH power fault ok info 0 2021-11-15T21:33:17Z
v6h:power-domain:cpu:0p85v-pvsa CPU_0P85V_PVSA power fault ok info 0 2021-11-15T21:33:17Z
v6h:power-domain:cpu:1p0v-pvccana CPU_1P0V_PVCCANA power fault ok info 0 2021-11-15T21:33:17Z
v6h:power-domain:cpu:1p0v-pvccio CPU_1P0V_PVCCIO power fault ok info 0 2021-11-15T21:33:17Z
v6h:power-domain:cpu:1p2v-vddqabcd CPU_1P2V_VDDQ_ABCD power fault ok info 0 2021-11-15T21:33:17Z
v6h:power-domain:cpu:1p2v-vddqefgh CPU_1P2V_VDDQ_EFGH power fault ok info 0 2021-11-15T21:33:17Z
v6h:power-domain:cpu:1p8v-cpu CPU_1P8V_CPU power fault ok info 0 2021-11-15T21:33:17Z
v6h:power-domain:cpu:1p8v-pvccin CPU_1P8V_PVCCIN power fault ok info 0 2021-11-15T21:33:17Z
v6h:power-domain:cpu:2p5v-vppabcd CPU_2P5V_VPP_ABCD power fault ok info 0 2021-11-15T21:33:17Z
v6h:power-domain:cpu:2p5v-vppefgh CPU_2P5V_VPP_EFGH power fault ok info 0 2021-11-15T21:33:17Z
v6h:thermal-fault:cpu:mem-hot CPU_MEMHOT thermal fault ok info 0 2021-11-15T21:33:17Z
v6h:thermal-fault:cpu:mem-trip CPU_MEMTRIP thermal fault ok info 0 2021-11-15T21:33:17Z
v6h:thermal-fault:pch:hot PCH_HOT thermal fault ok info 0 2021-11-15T21:33:17Z
v6h:thermal-fault:pch:vnn-vr-hot PCH_VNN_VR_HOT thermal fault ok info 0 2021-11-15T21:33:17Z
v6h:thermal-fault:pvccin-vr-hot PVCCIN_VR_HOT thermal fault ok info 0 2021-11-15T21:33:17Z
v6h:thermal-fault:vcciosa-vr-hot VCCIOSA_VR_HOT thermal fault ok info 0 2021-11-15T21:33:17Z
Here’s the available health check endpoints for firmware and hardware, using confd tab completion.
show system health components component appliance firmware ¶
Possible completions:
appliance/firmware/bios The firmware unique identifier
appliance/firmware/bios/me The firmware unique identifier
appliance/firmware/cpld The firmware unique identifier
appliance/firmware/drives The firmware unique identifier
appliance/firmware/drives/u.2-slot1 The firmware unique identifier
appliance/firmware/drives/u.2-slot2 The firmware unique identifier
appliance/firmware/fpga/asw The firmware unique identifier
appliance/firmware/fpga/atse0 The firmware unique identifier
appliance/firmware/fpga/atse1 The firmware unique identifier
appliance/firmware/fpga/nso The firmware unique identifier
appliance/firmware/lcd The firmware unique identifier
appliance/firmware/lcd/app The firmware unique identifier
appliance/firmware/lcd/bootloader The firmware unique identifier
appliance/firmware/lop/bootloader The firmware unique identifier
appliance/firmware/lop/lop-app The firmware unique identifier
appliance/firmware/tpm/sirr The firmware unique identifier
show system health components component appliance hardware ¶
Possible completions:
appliance/hardware/cpu The hardware unique identifier
appliance/hardware/cpu/pcie The hardware unique identifier
appliance/hardware/drives The hardware unique identifier
appliance/hardware/fpga/asw The hardware unique identifier
appliance/hardware/fpga/asw/ports/11.0 The hardware unique identifier
appliance/hardware/fpga/asw/ports/12.0 The hardware unique identifier
appliance/hardware/fpga/asw/ports/13.0 The hardware unique identifier
appliance/hardware/fpga/asw/ports/14.0 The hardware unique identifier
appliance/hardware/fpga/asw/ports/15.0 The hardware unique identifier
appliance/hardware/fpga/asw/ports/16.0 The hardware unique identifier
appliance/hardware/fpga/asw/ports/17.0 The hardware unique identifier
appliance/hardware/fpga/asw/ports/18.0 The hardware unique identifier
appliance/hardware/fpga/asw/ports/19.0 The hardware unique identifier
appliance/hardware/fpga/asw/ports/20.0 The hardware unique identifier
appliance/hardware/fpga/atse0 The hardware unique identifier
appliance/hardware/fpga/atse1 The hardware unique identifier
appliance/hardware/fpga/nso The hardware unique identifier
appliance/hardware/fpga/nso/ports/1.0 The hardware unique identifier
appliance/hardware/fpga/nso/ports/2.0 The hardware unique identifier
appliance/hardware/fpga/nso/ports/3.0 The hardware unique identifier
appliance/hardware/fpga/nso/ports/4.0 The hardware unique identifier
appliance/hardware/fpga/nso/ports/5.0 The hardware unique identifier
appliance/hardware/fpga/nso/ports/6.0 The hardware unique identifier
appliance/hardware/fpga/nso/ports/7.0 The hardware unique identifier
appliance/hardware/fpga/nso/ports/8.0 The hardware unique identifier
appliance/hardware/fpga/nso/ports/9.0 The hardware unique identifier
appliance/hardware/fpga/nso/ports/10.0 The hardware unique identifier
appliance/hardware/lop The hardware unique identifier
appliance/hardware/memory The hardware unique identifier
appliance/hardware/optics The hardware unique identifier
appliance/hardware/optics/optic1 The hardware unique identifier
appliance/hardware/optics/optic2 The hardware unique identifier
appliance/hardware/optics/optic3 The hardware unique identifier
appliance/hardware/optics/optic4 The hardware unique identifier
appliance/hardware/optics/optic5 The hardware unique identifier
appliance/hardware/optics/optic6 The hardware unique identifier
appliance/hardware/optics/optic7 The hardware unique identifier
appliance/hardware/optics/optic8 The hardware unique identifier
appliance/hardware/optics/optic9 The hardware unique identifier
appliance/hardware/optics/optic10 The hardware unique identifier
appliance/hardware/optics/optic11 The hardware unique identifier
appliance/hardware/optics/optic12 The hardware unique identifier
appliance/hardware/optics/optic13 The hardware unique identifier
appliance/hardware/optics/optic14 The hardware unique identifier
appliance/hardware/optics/optic15 The hardware unique identifier
appliance/hardware/optics/optic16 The hardware unique identifier
appliance/hardware/optics/optic17 The hardware unique identifier
appliance/hardware/optics/optic18 The hardware unique identifier
appliance/hardware/optics/optic19 The hardware unique identifier
appliance/hardware/optics/optic20 The hardware unique identifier
appliance/hardware/psu The hardware unique identifier
appliance/hardware/qat The hardware unique identifier
appliance/hardware/tpm The hardware unique identifier
You would then access these via the ‘system health components componentappliance’ path:
show system health components component appliance firmware appliance/firmware/<module>
API¶
The REST API has endpoints for these services. This API call goes to the hardware system health category. Output is abbreviated.
curl -sku admin:admin -H “Accept: application/yang-data+json” https://<rSeries_device>:8888/restconf/data/openconfig-system:system/f5-system-health:health/components/component=appliance/hardware | more
{
"f5-system-health:hardware": [
{
"key": "appliance/hardware/cpu",
"state": {
"name": "CPU",
"health": "ok",
"severity": "info"
},
"attributes": {
"attribute": [
{
"name": "cpu:core:temperature",
"description": "CPU core temperature (C)",
"health": "ok",
"severity": "info",
"value": "",
"updatedAt": "2021-10-26T20:49:28Z"
},
{
"name": "cpu:state:fatal-error-fault",
"description": "Fatal error",
"health": "ok",
"severity": "info",
"value": "0",
"updatedAt": "2021-11-15T21:33:17Z"
},
{
"name": "cpu:state:fivr-fault",
"description": "FIVR Fault",
"health": "ok",
"severity": "info",
"value": "0",
"updatedAt": "2021-11-15T21:33:17Z"
},
{
:
Tcpdump¶
Tcpdump filters packets on internal interfaces and leverages the fpga. All native tcpdump options are supported.
You access tcpdump from the confd CLI while logged in as admin. Here are some basic tcpdump commands, all tcpdump options should be supported.
Command | Comment |
---|---|
system diagnostics tcpdump
Capture all packets on all interfaces |
---|
system diagnostics tcpdump -i 1.0
Capture packets only on interface 1.0 |
---|
system diagnostics tcpdump -i 1.0 icmp
Capture only icmp packets on interface 1.0 |
---|
system diagnostics tcpdump -i 1.0 -w t.pcap
Capture packets interface 1.0 and write to /var/F5/system/shared/tcpdump/t.pcap |
---|
system diagnostics tdpcump -r "/var/F5/system/shared/tcpdump/t.pcap"
Read the tdpcump file called t.pcap |
---|
system diagnostics tdpcump ip[9]==1
| Match if the 9th byte of an IP packet=1, e.g. icmp |
Logs and log level¶
The log level is configurable at the system component level. For more information about configuring log level seeF5OS-A - Software Architecture Overview
The primary log file is /var/F5/system/log/velos.log.
If you want tcpdump to log when it has been started:
system logging sw-components sw-component tcpdumpd-master config severity DEBUG
In confd:
file show log/system/velos.log | include tcpdump
As root:
tail -f /var/F5/system/log/velos.log | grep tcpdump
Qkview¶
The GUI supports generating “System Report”, aka a qkview. (1) navigate to System Settings :: System Reports, (2) generate the qkview, (3) upload it to iHealth, and (4) access it.
You can also manage qkviews in confd. Here’s the confd tab completion suggestions for qkview:
system diagnostics qkview ¶
Possible completions:
cancel Cancel a qkview in progress
capture Start collecting diagnostics data
delete Delete a qkview file
list List qkview files
status Get the status of a qkview in progress
Alerts / Alarms¶
In confd:
show system alarms¶
% No entries found.
In the GUI, navigate to System Settings :: Alarms & Events
Alerts are monitored on the ZMQ bus by a service called alert-service. As a platform service, you can check the health of alert-service with confd.
show system health components component appliance services appliance/services/alert-service state¶
state name alert-service
state health ok
state severity info
Remove the ‘state’ parameter to see metadata related to a service
LCD diagnostics¶
The front-panel LCD has a “Health” button on the touch screen that you can click to run LCD diagnostics. This video explains the routine:
Firmware diagnostics¶
Firmware diagnostics are gathered automatically by diag-agent. You can find the diag-agent output in the “Health” portion of the qkview analyzer on iHealth.
There is also a dedicated section for diag-agent in the Files menu.
AOM¶
Access the AOM by pressingESC
+ (
while connected to the front-panel serial port.
AOM Command Menu:
A --- Reset AOM
B --- Set baud rate
I --- Display platform information
P --- Power on/off host subsystem
R --- Reset host subsystem
U --- Front panel USB port
Q --- Quit menu and return to console
Enter Command: I
Host subsystem information
Serial number : f5-nejk-ojpm
Power status : on
Front-panel USB port : enabled next boot
Power on self test status : passed
Runtime status : healthy
Firmware versions
AOM firmware version : 1.00.213.0.1
AOM bootloader version : 1.02.062.0.1
CPLD version : 02.0A.00
AOM management network configuration
IPv4 address : not available
IPv4 netmask : not available
IPv4 gateway : not available
IPv6 address : not available
MAC address : not available
SSH session idle timeout : not available
Power supply 1 status : Present, Input OK, Output OK
Manufacturer : Murata-PS
Model : MW2100
Serial number : FZ2104Q70043
Power supply 2 status : Present, Input OK, Output OK
Manufacturer : Murata-PS
Model : MW2100
Serial number : FZ2104Q70247
AOM Command Menu:
A --- Reset AOM
B --- Set baud rate
I --- Display platform information
P --- Power on/off host subsystem
R --- Reset host subsystem
U --- Front panel USB port
Q --- Quit menu and return to console
Enter Command: Q
[root@appliance-1 ~]#