SPKs Coremond

Overview

The F5 Coremond component runs as a DaemonSet on Service Proxy for Kubernetes (SPK). A Core file is a snapshot of the memory and register state of a process or a program when it terminates unexpectedly due to an uncertain or unexpected event that triggers default signal handling. Root-cause analysis can be performed on the core file. The core files are generated either by a third party or by the kernel itself.

Coremond monitors /var/crash folder mapped to a volume to detect updates to core files as the Coremond pod does not have access to the operating system. When Coremond starts, it reads the core_pattern from /proc/sys/kernel/ to decide if the configured core_pattern is supported.

SPK Openshift platform store all the core files generated in a single directory on the host at/var/lib/systemd/coredump file path. This directory is not created by default. You can create one or enable it through installation to store the core files. F5 recommends to enable the directory during installation.

Prerequisites

Ensure you have the following:

  1. A working cluster with Openshift platform.

  2. A linux based workstation

  3. A core_pattern file located at /proc/sys/kernel/core_pattern. Some of the supported core patterns are:

    • By default, for Openshift platform, the core dump used by the system is systemd-coredump with xz, lz4 or zst extension, such as (|/usr/lib/systemd/systemd-coredumps %P %u %g %s %t 9223372036854775808 %h)

    • In Robin.io, the native Kernel must be /var/crash/core.%e.%p.%h.%t otherwise, an error is returned

      Specifier Description
      %h/td> Hostname
      %e/td> Executable filename
      %p/td> pid of the process
      %t/td> UNIX time of dump

Note: F5 recommends to install the Coremond first before installing any other F5 components. This is suggested as if there are any other components installed prior to Coremond, they may generate the core files.

Configure Rotation and Retention

This section outlines the environment variables used to configure the core file retention, rotation, and cleanup of Coremond. These variables allow you to manage retention durations, set file limits per process, and define rotation policies.

Environment Variable Default Value Description
COREMON_RETENTION_INTERVAL 5m Specifies the time frame to ignore additional core dumps from the same process once COREMON_CORES_MAX_FILES limit is reached.
COREMON_CORES_MAX_FILES 3 Specifies the maximum number of core files allowed for the same process. This parameter is used to prevent continuous crashes and rotations.
COREMON_RETENTION 0 Specifies the duration to keep core files before deletion. This also applies to the final core file copied to the volume. To disable the retention, set the value of this parameter to 0.
COREMON_CORES_INTERVAL 5m Specifies the interval or duration at which, Coremond schedules scanning and deletion of core files exceeding the COREMON_RETENTION period.
COREMON_ROTATE false Allows to replace old core files with the new ones, when COREMON_CORES_MAX_FILES limit is reached. This only occurs if COREMON_RETENTION_INTERVAL limit is elapsed and the Coremond continues processing core files for that process.

Procedures

Platform-Specific Core Patterns

  • Generic: Ubuntu-based platforms use Apport for crash reporting, and this pattern ensures that core dumps are handled correctly by Apport.

    |/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E                                                  
    
  • OCP: OCP uses systemd-coredump to capture and process core dumps. The pattern correctly passes the process ID (%P), user ID (%u), group ID (%g), signal (%s), timestamp (%t), and other relevant metadata.

    |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e
    
  • Robin: Robin.io uses a traditional file-based core dump storage format, where…

    /var/crash/core.%e.%p.%h.%t
    

Configure Rotation and Retention

This section outlines the environment variables used to configure the core file retention, rotation, and cleanup of Coremond. These variables allow you to manage retention durations, set file limits per process, and define rotation policies.

Environment Variable Default Value Description
COREMON_RETENTION_INTERVAL 5m Specifies the time frame to ignore additional core dumps from the same process once COREMON_CORES_MAX_FILES limit is reached.
COREMON_CORES_MAX_FILES 3 Specifies the maximum number of core files allowed for the same process. This parameter is used to prevent continuous crashes and rotations.
COREMON_RETENTION 0 Specifies the duration to keep core files before deletion. This also applies to the final core file copied to the volume. To disable the retention, set the value of this parameter to 0.
COREMON_CORES_INTERVAL 5m Specifies the interval or duration at which, Coremond schedules scanning and deletion of core files exceeding the COREMON_RETENTION period.
COREMON_DELETE_SRC true Specifies to delete source core files from the host path /home/crash/f5 generated by the kernel.
COREMON_ROTATE false Allows to replace old core files with the new ones, when COREMON_CORES_MAX_FILES limit is reached. This only occurs if COREMON_RETENTION_INTERVAL limit is elapsed and the Coremond continues processing core files for that process.

Procedures

Installation

Obtain the [TAG/Version] from the CNE 2.1.0 tarball.

  1. Install the Coremond by using the following syntax on Openshift and Tanzu platforms:

    helm install coremond tar/<helm-chart>.tgz \ -f <values>.yaml -n <project>
    

    For example:

    helm install coremond tar/coremond-0.7.27-10.0.14.tgz -n coremond
    
  2. You can edit the values.yaml file as per usecase and requirement. Following are some of the mandatory and optional settings that can done by editing the values.yaml file:

    a. Mandatory settings:

    • Override the image settings by specifying the custom image values:

      image:
      repository: repo.f5.com/images/f5-toda-docker
      name: f5-coremond
      tag: v
      pullPolicy: IfNotPresent
      
    • Coremond supports the usage of node selectors and node affinity to specify the nodes. For this, a Coremond pod should be scheduled in a Kubernetes cluster. By default, Coremond runs on all worker nodes.
      To run the pod on the worker-node node, configure both the nodeSelector and affinity as shown in the following example.

      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                - worker-node
        nodeSelector:
          kubernetes.io/hostname: worker-node
      

    b. Optional settings:

    • Coremond supports storing core files directly on the host directory instead of using Persistent Volumes (PVCs), eliminating the need for ReadWriteMany volumes and shared storage when multiple Coremond pods are deployed. By default, this option is disabled and PVs are used.

      To change the default and store cores on the Host machine instead of PVs, set the following value to true in values.yaml file:

      useHostPath: true
      
    • To adjust the Log level setting in COREMON_LOG_LEVEL value, add the following in values.yaml file:

      env:  
      - name: COREMON_LOG_LEVEL
      value: "debug"
      
    • Coremond requires a PV with RWX access and if the default storage class does not support that, it may cause the Coremond to remain pending. To avoid this, override the storageClass parameter with RWX through values.yaml file.

      Following is an example to override the file:

      persistence:
        accessMode: ReadWriteMany
        storageClass: your-rwx
      
    • To override the resources settings, specify the custom resources values in the values.yaml file as shown in the following example:

      resources: 
          limits:
          cpu: 100m
          memory: 128Mi
          requests:
          cpu: 100m
          memory: 128Mi
      
    • To disable the qkview process, run the following command:

      f5_csm_qkview:
      enabled: false
      
    • To override the fluentbit_sidecar image settings, specify the custom image values as shown in the following example:

      fluentbit_sidecar:  
        image:  
            repository: repo.f5.com/images/f5-toda-docker
            name: f5-fluentbit
            tag: v
            pullPolicy: IfNotPresent
      
    • To override the fluentbit_sidecar resources settings, specify the custom resources values as shown in the following example:

      fluentbit_sidecar:  
        resources:  
          limits:
            cpu: "0.5"
            memory: "512Mi"
          requests:
            cpu: "0.25"
            memory: "256Mi"
      
    • To override the fluentbit_sidecar security context settings, specify the custom securityContext values as shown in the following example:

      fluentbit_sidecar:
        securityContext:
          allowPrivilegeEscalation: false
          # runAsUser: 10000
      
    • To override the fluentbit_sidecar additional settings, specify the custom fluentbit values as shown in the following example:

      fluentbit_sidecar:
        fluentbit:
          # Interval to flush output (seconds)
          flush_interval: 1
          # Error/warning/info/debug/trace
          logLevel: debug
          # Pipe reading parameters
          input:
            pipes:
              bufSize: 8096
              intervalSec: 1
              intervalNsec: 0
          tls:
            enabled: false
            # TLS debug verbosity level, values: 0 (No debug), 1 (Error), 2 (State change), 3 (Informational) and 4 (Verbose)
            debug: 1
            # Force certificate validation
            verify: Off
            # key string known by the remote Fluentd used for authorization.
            shared_key: f5-toda-shared-key
      fluentd:
        host: '127.0.0.1'
        port: 54321
      
    • To disable fluentbit_sidecar container, set the fluentbit_sidecar value to false in values.yaml file:

      fluentbit_sidecar:
        enabled: false
      

Generate a Core File

To generate a core file, follow these steps:

  1. Run the command to get the list of pods.

    kubectl get pods - A
    

    Sample output with the list of pods:

    NAME                                          READY   STATUS    RESTARTS   AGE
    client                                        1/1     Running   0          2m28s
    dssm-f5-dssm-db-0                             2/2     Running   0          2m26s
    dssm-f5-dssm-db-1                             2/2     Running   0          96s
    dssm-f5-dssm-sentinel-0                       2/2     Running   0          2m26s
    dssm-f5-dssm-sentinel-1                       2/2     Running   0          90s
    f5-cert-manager-84f857f786-gk6xq              1/1     Running   0          4m10s
    f5-cert-manager-cainjector-695866d7ff-m2h2g   1/1     Running   0          4m10s
    f5-cert-manager-webhook-8554fd5b58-xc89x      1/1     Running   0          4m10s
    f5-coremond-7gqfp                             2/2     Running   0          2m54s
    f5-crdconversion-7df678d8fc-2vplv             1/1     Running   0          2m51s
    f5-rabbit-f9c58487c-vhtw2                     1/1     Running   0          2m53s
    f5-spk-cwc-669f8c9dc-ptjb2                    2/2     Running   0          2m52s
    f5-tmm-7b685cd57c-lp7cl                       0/4     Pending   0          2m9s
    f5-tmm-7b685cd57c-rq92s                       4/4     Running   0          2m9s
    f5-toda-fluentd-6bc5cb8bfb-wqsvx              1/1     Running   0          2m11s
    f5-toda-observer-788ddcd596-6qjpg             2/2     Running   0          2m12s
    f5-toda-stats-77cb79c44d-4cn4x                2/2     Running   0          2m25s
    otel-collector-5f48b7ccf7-s6wx7               1/1     Running   0          2m9s
    router                                        2/2     Running   0          2m27s
    server                                        1/1     Running   0          2m28s
    spk-f5ingress-797bdbb59-zssd6                 4/4     Running   0          2m9s
    
  2. Run the command to get the process list.

    kubectl exec <pod-name> -- ps aux
    

    Sample Output

    Defaulting container name to f5-observer.
    USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
    f5docker       1  0.0  0.0 711880  3024 ?        Ssl  09:45   0:00 /init
    f5docker      25  0.0  0.0   3024  1200 ?        S    09:45   0:00 s6-svscan -c30 -t0 /var/run/s6/services
    f5docker      27  0.0  0.0   3036  1264 ?        S    09:45   0:00 s6-supervise observer
    f5docker      28  0.0  0.0   3036  1268 ?        S    09:45   0:00 s6-supervise qkview-collect-daemon
    f5docker      29  1.2  0.3 1270624 49540 ?       Ssl  09:45   0:01 observer
    f5docker      30  0.0  0.0 1235736 9892 ?        Ssl  09:45   0:00 /usr/bin/qkview-collect-daemon
    f5docker     212  0.0  0.0   7072  1592 ?        Rs   09:47   0:00 ps aux
    
  3. Run the command to kill a process and generate the core dumps.

    kubectl exec -- kill -11 <process-id> 
    

    Sample Output

    Defaulting container name to f5-observer out of: f5-toda-observer, fluentbit
    

Validate the Core File

To verify the generated core file, follow the instructions below:

  1. Run the command to get the Coremond pod name.

    kubectl get pods - A
    

    Sample Output

      NAME                                          READY   STATUS    RESTARTS   AGE
      client                                        1/1     Running   0          2m28s
      dssm-f5-dssm-db-0                             2/2     Running   0          2m26s
      dssm-f5-dssm-db-1                             2/2     Running   0          96s
      dssm-f5-dssm-sentinel-0                       2/2     Running   0          2m26s
      dssm-f5-dssm-sentinel-1                       2/2     Running   0          90s
      f5-cert-manager-84f857f786-gk6xq              1/1     Running   0          4m10s
      f5-cert-manager-cainjector-695866d7ff-m2h2g   1/1     Running   0          4m10s
      f5-cert-manager-webhook-8554fd5b58-xc89x      1/1     Running   0          4m10s
      f5-coremond-7gqfp                             2/2     Running   0          2m54s
      f5-crdconversion-7df678d8fc-2vplv             1/1     Running   0          2m51s
      f5-rabbit-f9c58487c-vhtw2                     1/1     Running   0          2m53s
      f5-spk-cwc-669f8c9dc-ptjb2                    2/2     Running   0          2m52s
      f5-tmm-7b685cd57c-lp7cl                       0/4     Pending   0          2m9s
      f5-tmm-7b685cd57c-rq92s                       4/4     Running   0          2m9s
      f5-toda-fluentd-6bc5cb8bfb-wqsvx              1/1     Running   0          2m11s
      f5-toda-observer-788ddcd596-6qjpg             2/2     Running   0          2m12s
      f5-toda-stats-77cb79c44d-4cn4x                2/2     Running   0          2m25s
      otel-collector-5f48b7ccf7-s6wx7               1/1     Running   0          2m9s
      router                                        2/2     Running   0          2m27s
      server                                        1/1     Running   0          2m28s
      spk-f5ingress-797bdbb59-zssd6                 4/4     Running   0          2m9s
    
  2. Run the command to find the core file created.

    kubectl -n f5-utils logs f5-coremond-bnhf7 -c f5-coremond
    

    Sample Output

    Defaulted container "f5-coremond" out of: f5-coremond, fluentbit, init-coremond-dir (init)
    2024-09-16 09:44:37,954 CRIT Supervisor is running as root.  Privileges were not dropped because no user is specified in the config file.  If you intend to run as root, you can set user=root in the config file to avoid this message.
    2024-09-16 09:44:37,957 INFO supervisord started with pid 1
    2024-09-16 09:44:38,960 INFO spawned: 'coremond' with pid 13
    2024-09-16 09:44:38,962 INFO spawned: 'qkview-collect' with pid 14
    2024/09/16 09:44:38 INFO qkview-collect f5-log-ID=15216-000000 lt=A msg="Details: Client config details {Base:/etc/qkview-collect Overlay:/etc/qkview-collect/qkview-collect.config.yml GlobalTimeout:-1s LocalTimeout:-1s Outfile:/tmp/qkview.tar.gz PkgType:container MaxFileSize:25 RemovePrivateKeyFromFiles:true} base config file..."
    2024/09/16 09:44:38 INFO qkview-collect f5-log-ID=15216-000000 lt=A msg="Details: Environment details &{IsDevVersion:false HostMode:false TLSCABundle:/etc/ssl/certs/ca-root-cert.pem TLSCertificateFile:/etc/ssl/certs/server-cert.pem TLSKeyFile:/etc/ssl/certs/server-key.pem TLSCertRetryWait:5s SecureOnly:true UsingCertOrchestrator:true ContainerName:f5-coremond GrpcPort:19891 MaxFileSize:25 BaseCfgPath:/etc/qkview-collect ContainerOverlayPath:/etc/qkview-collect/qkview-collect.config.yml TotalCollectionTimeout:-1s IndividualCmdTimeout:-1s Outfile:/tmp/qkview.tar.gz PkgType:container RemovePrivateKeyFromFiles:true} base config file..."
    2024/09/16 09:44:38 INFO qkview-collect f5-log-ID=15216-000000 lt=A msg="Info: Starting GRPC server in secured mode"
    2024/09/16 09:44:38 INFO qkview-collect f5-log-ID=15216-000000 lt=A msg="Info: starting secure server"
    "ts"="2024-09-16 09:44:39.000"|"l"="error"|"m"="failed to read levels file /logs/.minlevel.yaml: open /logs/.minlevel.yaml: no such file or directory"|"lt"="A"|"pod"="f5-coremond-7gqfp"|"ct"="f5-coremond"|"cv"="v0.5.12"|"ns"="default"|"v"="1.0"
    "ts"="2024-09-16 09:44:39.000"|"l"="info"|"m"="coremond started"|"lt"="A"|"version"="0.5.12"|"commitHash"="22bb5c8"|"buildDate"="2024-08-27T20:57:25Z"|"pod"="f5-coremond-7gqfp"|"ct"="f5-coremond"|"cv"="v0.5.12"|"ns"="default"|"v"="1.0"
    "ts"="2024-09-16 09:44:39.207"|"l"="error"|"m"="no such file or directory"|"lt"="A"|"path"="/logs/.minlevel.yaml"|"pod"="f5-coremond-7gqfp"|"ct"="f5-coremond"|"cv"="v0.5.12"|"ns"="default"|"v"="1.0"
    2024-09-16 09:44:40,209 INFO success: coremond entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
    2024-09-16 09:44:40,209 INFO success: qkview-collect entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
    "ts"="2024-09-16 09:47:47.997"|"l"="info"|"m"="new core file detected"|"lt"="A"|"file"="/var/crash/core.observer.29.f5-toda-observer-788ddcd596-6qjpg.1726480067"|"pod"="f5-coremond-7gqfp"|"ct"="f5-coremond"|"cv"="v0.5.12"|"ns"="default"|"v"="1.0"
    "ts"="2024-09-16 09:47:48.014"|"l"="info"|"m"="creating coredump"|"lt"="A"|"src"="/var/crash/core.observer.29.f5-toda-observer-788ddcd596-6qjpg.1726480067"|"dst"="/var/cores/core.f5-toda-observer.f5-toda-observer.observer.29.1726480067"|"pod"="f5-coremond-7gqfp"|"ct"="f5-coremond"|"cv"="v0.5.12"|"ns"="default"|"v"="1.0"
    
  3. Run the command to validate the core file created by F5.

    kubectl -n f5-utils exec <coremon-pod> -- ls /var/crash`
    

    Sample Output

    dev@datkube-devbox:~/ws/datkube$ oc exec f5-coremond-7gqfp -- ls /var/crash
    Defaulted container "f5-coremond" out of: f5-coremond, fluentbit, init-coremond-dir (init)
    core.observer.29.f5-toda-observer-788ddcd596-6qjpg.1726480067
    
  4. To validate the core file created by F5, run oc exec <coremon-pod> -- ls /var/cores command.

    Sample output:

    dev@datkube-devbox:~/ws/datkube$ oc exec f5-coremond-7gqfp -- ls /var/cores
    Defaulted container "f5-coremond" out of: f5-coremond, fluentbit, init-coremond-dir (init)
    core.f5-toda-observer.f5-toda-observer.observer.29.1726480067.gz
    core.f5-toda-observer.f5-toda-observer.observer.29.1726480067.gz.crc