Distributed Toda for Stats Aggregation

Overview

Cloud-Native Network Functions (CNFs) generates massive data, at a rate of terabytes per second. To manage this high-volume of statistics efficiently, the Distributed Telemetry over Data Aggregation (TODA) for Stats Aggregation system has been enhanced with four primary pods. The pods are Receiver, Observer, Coordinator (Operator), and TMM Scraper.

Distributed Toda Pods Roles

Following are the key responsibilities of every Distributed TODA pod:

  • Receiver: The receiver runs as a StatefulSet. It collects metrics from TMM, stores them persistently, and sends them to the Observer over gRPC (Remote Procedure Call) with mutual TLS (mTLS) for secure aggregation.

  • Observer Aggregator: The observer ‌runs as a StatefulSet. It aggregates metrics from multiple TMMs, received from the receivers, and securely forwards them to the OTEL collector over gRPC with mTLS for further aggregation and standardization.

  • Coordinator (Operator): The operator oversees the entire metric collection and aggregation process. It coordinates the collection and aggregation of resources with corresponding requests over gRPC with mTLS, ensuring efficient and secure metrics flow.

  • TMM Scraper: TMM Scraper is an observer container that runs inside each TMM pod, replacing the tmstatsd tool. It directly serves metrics from tmctl over a gRPC response stream upon receiving requests from the Receiver.

Metrics Flow Architecture

This section highlights the Metrics flow architecture of V2 and V1 Metrics.

V2 Metrics

The following image depicts the V2 metrics flow architecture:

Example:

Following is an example of the virtual server clientside.received.bytes v2 metrics example collected from OTEL.

{
    "resourceMetrics": [
        {
            "scopeMetrics": [
                {
                    "scope": {
                        "name": "io.f5.toda.observer",
                        "version": "5.5.30"
                    },
                    "metrics": [
                        {
                            "name": "f5.virtual_server.clientside.received.bytes",
                            "unit": "By",
                            "sum": {
                                "dataPoints": [
                                    {
                                        "attributes": [
                                            {
                                                "key": "f5.virtual_server.destination",
                                                "value": {
                                                    "stringValue": "00:00:00:00:00:00:00:00:00:00:FF:FF:37:37:37:01:00:00:00:00"
                                                }
                                            },
                                            {
                                                "key": "f5.virtual_server.name",
                                                "value": {
                                                    "stringValue": "spk-app-1-spk-app-tcp-8050-f5ing-testapp-virtual-server"
                                                }
                                            },
                                            {
                                                "key": "f5.virtual_server.source",
                                                "value": {
                                                    "stringValue": "00:00:00:00:00:00:00:00:00:00:FF:FF:00:00:00:00:00:00:00:00"
                                                }
                                            }
                                        ],
                                        "startTimeUnixNano": "1751183898216943595",
                                        "timeUnixNano": "1751889465853307727",
                                        "asInt": "0"
                                    }
                                ],
                                "aggregationTemporality": 2,
                                "isMonotonic": true
                            }
                        }
                    ]
                }
            ]
        }
    ]
}

V1 Metrics

In V1 metric flow architecture, the metrics are streamed directly from TMM by tmstatsd to OTEL, without aggregation.

The following image depicts the V1 metrics flow architecture:

Example:

Following is an example of the virtual server clientside.received.bytes v1 metrics collected from OTEL.

V1 metrics

{
    "resourceMetrics": [
        {
            "resource": {
                "attributes": [
                    {
                        "key": "host.name",
                        "value": {
                            "stringValue": "f5-tmm-fb54985cc-6nbl2"
                        }
                    },
                ]
            },
            "scopeMetrics": [
                {
                    "scope": {
                        "name": "demo-client-meter"
                    },
                    "metrics": [
                        {
                            "name": "virtual_server_stat/spk-app-1-spk-app-tcp-8050-f5ing-testapp-virtual-server/clientside.bytes_out",
                            "description": "TMM tmstatsd: table[virtual_server_stat] row[spk-app-1-spk-app-tcp-8050-f5ing-testapp-virtual-server] column[clientside.bytes_out] type:[Gauge] metric[virtual_server_stat/spk-app-1-spk-app-tcp-8050-f5ing-testapp-virtual-server/clientside.bytes_out]",
                            "gauge": {
                                "dataPoints": [
                                    {
                                        "attributes": [
                                            {
                                                "key": "column",
                                                "value": {
                                                    "stringValue": "clientside.bytes_out"
                                                }
                                            },
                                            {
                                                "key": "name",
                                                "value": {
                                                    "stringValue": "spk-app-1-spk-app-tcp-8050-f5ing-testapp-virtual-server"
                                                }
                                            },
                                            {
                                                "key": "tableName",
                                                "value": {
                                                    "stringValue": "virtual_server_stat"
                                                }
                                            },
                                            {
                                                "key": "tmmID",
                                                "value": {
                                                    "stringValue": "f5-tmm-fb54985cc-6nbl2"
                                                }
                                            }
                                        ],
                                        "startTimeUnixNano": "11651379494838206464",
                                        "timeUnixNano": "1751891173853319814",
                                        "asInt": "0"
                                    }
                                ]
                            }
                        }
                    ]
                }
            ],
            "schemaUrl": "https://opentelemetry.io/schemas/1.17.0"
        }
    ]
}

Advantages of V2 Metrics over V1 Metrics

The V2 metrics system introduces several enhancements compared to the V1 metrics system, improving standardization, aggregation, and descriptive capabilities across the telemetry data. Following are a few of advantages listed here:

  • V2 metrics are aggregated across all TMMs, providing a unified view of performance and resource usage across the entire system.

  • Each metric in V2 is standardized with fixed names, ensuring consistency and simplifying interpretation across different platforms and tools.

  • V2 metrics include additional labels (also known as OpenTelemetry (OTEL) attributes) to provide detailed descriptions for each metric, enabling better context and improved observability.

Distributed TODA Pods

This section outlines the procedures to install required stats environment for V2 and V1 metrics.

Installation on V2

Following is the procedure to install the stats infrastructure for v2 metrics:

Note: Install the Observer in the same namespace as the F5Ingress.

  1. Enable the V2 Metrics on TMM during the installation of f5ingress pod. For information on how to enable V2 Metrics, see step 11 in the TMM Values of BIG-IP Controller.

  2. Change into the directory containing the latest CNFs Software, and obtain the f5-toda-observer Helm chart version.

    In this example, the CNF files are in the cnfinstall directory:

    cd cnfinstall
    
    ls -1 tar | grep observer
    

    The example output should appear similar to the following:

    f5-toda-observer-v4.56.4-0.0.15.tgz
    
  3. Create a Helm values file named observer_values.yaml file and set the image.repository and fluentbit_sidecar.image.repository parameters.

    image:
      repository: registry.com
    
    persistence:
       storageClassName: ""
       accessMode: ReadWriteOnce
       # size: 3Gi
    
    platformType: "robin"
    
    fluentbit_sidecar:
      image:
        repository: registry.com
      fluentbit:
        tls:
        enabled: true
      fluentd:       
        host: f5-toda-fluentd.cnf-gateway.svc.cluster.local.
    

    Note: If the persistence profile is not defined (default) or explicitly set to null, the storageClassName specification will not be set, and the default provisioner will be used.

  4. Install the observer using helm.

    helm install observer f5-toda-observer-<VERSION>.tgz   -f observer_values.yaml
    

    Note: Run helm show values <observer-chart> command to view advanced options.

    Important: The Operator and Receivers share the same volume. When they both are deployed on the same node, any storage class is supported. However, if Receivers are distributed across multiple nodes, a ReadWriteMany-compatible storage class, such as NFS, is required.

Installation on V1

Enable the V1 Metrics on TMM during the installation of f5ingress pod. For information on how to enable V1 Metrics, see step 11 in the TMM Values of BIG-IP Controller

OTEL Statistics

The full list of OTEL statistics can be reviewed here.

Prometheus and Grafana

Following are a few examples on how to view ‌metrics using Prometheus and Grafana:

Prometheus Integration for Metrics

This section describes the procedure to expose Metrics to Prometheus. The commands and Custom Resources (CRs) provided in this section are for the OpenTelemetry (OTEL) running in the default namespace. If OTEL is running in a different namespace, modify the commands and CRs accordingly.

Note: To create required certificates for OTEL to communicate with third party applications such as Prometheus, see OTEL Collectors section.

  1. Create a dedicated namespace for Prometheus.

    oc create namespace prometheus
    
  2. Create a valid certificate for Prometheus using cert-manager.

    a. Copy the following data into prom-certs.yaml file and replace arm-ca-cluster-issuer with the name of your CA issuer. You can modify other fields as needed.

    apiVersion: cert-manager.io/v1
    kind: Certificate
    metadata:
      name: prometheus-client
      namespace: prometheus
    spec:
      subject:
        countries:
          - US
        provinces:
          - Washington
        localities:
          - Seattle
        organizations:
          - F5 Networks
        organizationalUnits:
          - PD
      emailAddresses:
        - clientcert@f5net.com
      commonName: f5net.com
      secretName: prometheus-client-secret
      issuerRef:
        name: arm-ca-cluster-issuer
        group: cert-manager.io
        kind: ClusterIssuer
      # Lifetime of the Certificate is 1 hour, not configurable
      duration: 2160h
      privateKey:
        rotationPolicy: Always
        encoding: PKCS1
        algorithm: RSA
        size: 4096
    

    b. Apply the certificate manifest using oc command.

    oc apply -f prom-certs.yaml
    
  3. Create a custom values.yaml file to configure and customize the Prometheus installation using Helm. Copy the following data into the file created.

    prometheus-pushgateway:
      enabled: false
    
    prometheus-node-exporter:
      enabled: false 
    
    kube-state-metrics:
      enabled: false 
    
    alertmanager:
      enabled: false 
    
    configmapReload:
      prometheus: 
        enabled: false 
    serverFiles:
      prometheus.yml:
        scrape_configs:
          - job_name: bnk-otel
            scheme: https
            static_configs:
              - targets:
                  - otel-collector-svc.default.svc.cluster.local:9090
            tls_config:
              cert_file: /etc/prometheus/certs/tls.crt
              key_file: /etc/prometheus/certs/tls.key
              ca_file: /etc/prometheus/certs/ca.crt
              insecure_skip_verify: false
    server:
      extraVolumes:
        - name: prometheus-tls
          secret:
            secretName: prometheus-client-secret
      extraVolumeMounts:
        - name: prometheus-tls
          mountPath: /etc/prometheus/certs
          readOnly: true
      global:
        scrape_interval: 10s
      service:
        type: "NodePort"
        nodePort: 31929
      configmapReload:
        enabled: false
      persistentVolume:
        enabled: false
    
  4. Deploy Prometheus with Helm.

    helm install prometheus oci://ghcr.io/prometheus-community/charts/prometheus -n prometheus --atomic -f values.yaml
    
  5. Ensure the scrape configuration is active, status is healthy, and there are no errors.

    curl http://172.18.0.4:31929/api/v1/targets | jq
    

    Note: Replace 172.18.0.4 with the IP address of the node where Prometheus is running.

  6. Run the following command to list all metrics currently ingested by Prometheus.

    curl http://172.18.0.4:31929/api/v1/label/__name__/values | jq
    

    Note: Replace 172.18.0.4 with the IP address of the node where Prometheus is running.

  7. Get total server-side connections for all pool members.

    curl "http://172.18.0.4:31929/api/v1/query?query=f5_tmm_f5_pool_member_serverside_connections_count_total" | jq
    

    Note: Replace 172.18.0.4 with the IP address of the node where Prometheus is running.

Grafana

Grafana can be connected to Prometheus to display dashboards based on the metrics sent to Prometheus. Here is an example of a Grafana dashboard for Virtual Server metrics.

Grafana