How to join labels of 2 different metrics on PromQL

Contents

Look into the situation

Đầu tiên chúng ta có 2 metric labels như sau:

Metrics and Labels:

windows_container_cpu_usage_seconds_kernelmode: This metric likely represents the total CPU time (in seconds) that a Windows container has spent in kernel mode.
kube_pod_container_info: This metric provides information about Kubernetes pods and their containers.

Look into `kube_pod_container_info` metrics.

With kube_pod_container_info

container_id: A unique ID for the container.
container: The name of the container.
image: The image used by the container.
image_id: A unique ID for the image used by the container.
namespace: The namespace in which the pod resides.
pod: The name of the pod.
container_restart_count: The number of times this container has been restarted.
container_state: The state of the container, e.g., “running”, “waiting”, “terminated”.
node: The node where the pod is running.
pod_ip: The IP address of the pod.
pod_uid: The unique ID of the pod.

Look into `windows_container_cpu_usage_seconds_kernelmode`

windows_container_cpu_usage_seconds_kernelmode thì có container_id, nhưng lại không có namespace, pod, container
==> Bằng mắt: thì chúng ta không thể nào nhận biết được CPU usage này của container nào.

Từ đây chúng ta lấy namespace, pod, container của kube_pod_container_info sang windows_container_cpu_usage_seconds_kernelmode

Query sẽ là:

windows_container_cpu_usage_seconds_kernelmode * on(container_id) group_left(namespace, pod, container) kube_pod_container_info{container_id!=""}

on(container_id) => This is a join operation. It joins the windows_container_cpu_usage_seconds_kernelmode metric with the kube_pod_container_info metric based on the container_id label.

group_left(namespace, pod, container) ensures that the resulting metric retains the namespace, pod, and container labels from the kube_pod_container_info metric.

kube_pod_container_info được cung cấp bới kube-state-metrics component.

mà trong cluster mình cài 2 kube-state-metrics components:
– kubecost-nimtechnology-staging
– prometheus-nimtechnology-staging
Và khi query mình bị lỗi bên dưới.

The error message you’re seeing is due to a group_left join in a Prometheus query that results in duplicate time series on the right side. This is common when you’re trying to join two metrics on a specific label (in this case, container_id), and the right metric (kube_pod_container_info in your query) has multiple entries with the same container_id.

mình đã sửa bằng cách: extra the Query: app_kubernetes_io_instance=”prometheus-nimtechnology-staging”

windows_container_cpu_usage_seconds_kernelmode * on(container_id) group_left(namespace, pod, container) kube_pod_container_info{app_kubernetes_io_instance="prometheus-nimtechnology-staging",container_id!=""}

Giờ chart trên grafana:
https://github.com/prometheus-community/windows_exporter/blob/master/docs/collector.container.md

windows_container_cpu_usage_seconds_total có type là counter

windows_container_cpu_usage_seconds_kernelmode * on(container_id) group_left(namespace, pod, container) kube_pod_container_info{app_kubernetes_io_instance="prometheus-nimtechnology-staging",container_id!=""}

Tiếp theo mình sẽ sử dụng rate để vẻ bảng chart.

rate(windows_container_cpu_usage_seconds_kernelmode * on(container_id) group_left(namespace, pod, container) kube_pod_container_info{app_kubernetes_io_instance="prometheus-nimtechnology-staging",container_id!=""}[1m])

chart khi có **rate(…[1m])** thì nó lại bị **spike**

https://stackoverflow.com/questions/34923788/prometheus-convert-cpu-user-seconds-to-cpu-usage
mình giảm 1m -> 30s

rate(windows_container_cpu_usage_seconds_kernelmode * on(container_id) group_left(namespace, pod, container) kube_pod_container_info{app_kubernetes_io_instance="prometheus-nimtechnology-staging",container_id!=""}[30])