prometheus基础-常用告警规则
prometheus 常用告警规则.
kubernetes
node
node 内存使用率大于90%
(node_memory_MemTotal_bytes{} - node_memory_MemAvailable_bytes{cluster="mota-gke"}) / node_memory_MemTotal_bytes{} * 100 > 90
node CPU使用率大于90%
sum by (cluster, instance) (avg by (cluster, mode, instance) (rate(node_cpu_seconds_total{mode!="idle"}[2m]))) * 100 > 90
node 每个CPU负载大于2
sum by (instance) (node_load1{}) / count by (instance) (node_cpu_seconds_total{mode="idle"}) > 2
node 磁盘使用率大于90%
(node_filesystem_size_bytes - (node_filesystem_avail_bytes{device=~"/dev/s.*"})) / node_filesystem_size_bytes * 100 > 90
node 磁盘 inode 使用率小于20%
node_filesystem_files_free{fstype!=""} / node_filesystem_files{fstype!=""} * 100 < 20
pod
pod 状态异常
min_over_time(sum by (cluster, namespace, pod, phase) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[5m:1m]) > 0
pod CPU使用率大于90%
(sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (cluster, pod, container) / sum(container_spec_cpu_quota{container!=""}/container_spec_cpu_period{container!=""}) by (cluster, pod, container) * 100) > 90
pod 内存使用率大于90%
(sum(container_memory_working_set_bytes{container!=""}) BY (cluster, container, pod) / sum(container_spec_memory_limit_bytes > 0) BY (cluster, container, pod) * 100) > 90
pod 出现OOM
((kube_pod_container_status_restarts_total{} - kube_pod_container_status_restarts_total{} offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m])) == 1
pod 频繁异常重启
delta(kube_pod_container_status_restarts_total{cluster="mota-gke"}[15m]) > 1
pod 长时间未就绪
created_by_kind!=“Job” 排除 job 类型的 pod 处于完成状态的未就绪。
min_over_time(sum by(namespace,host_ip,pod_ip,instance,pod,node)( kube_pod_info{created_by_kind!="Job"} AND ON (pod, namespace) kube_pod_status_ready{condition!="true"} == 1)[3m:1m])
pod 临时磁盘使用率
container_fs_usage_bytes: 表示容器使用的临时文件系统空间(字节),很多文章都说 container_fs_limit_bytes 表示容器的文件系统限制,实测这个指标拿到的是宿主机的磁盘容量,后面经过测试,使用 kube_pod_container_resource_limits 才是 ephemeral_storage 这个指标真正的值。
sum(container_fs_usage_bytes{pod=~"$Pod",container =~"$Container",container !="",container!="POD",node=~"^$Node$",namespace=~"$NameSpace"}) by (container,pod,node,namespace) / sum(kube_pod_container_resource_limits{resource="ephemeral_storage", pod=~"$Pod",container =~"$Container",container !="",container!="POD",node=~"^$Node$",namespace=~"$NameSpace"}) by (container,pod,node,namespace)
deployment
Deployment 可用副本状态异常
kube_deployment_spec_replicas{} != kube_deployment_status_replicas_available{} != 0
ingress
nginx-ingress 配置热加载失败次数
count(nginx_ingress_controller_config_last_reload_successful{} == 0)
每个 nginx-ingress-controller 实例的连接数
sum(nginx_ingress_controller_nginx_process_connections) by (controller_namespace,controller_pod)
每个 nginx-ingress-controller 实例中 process 数
nginx_ingress_controller_nginx_process_num_procs
每个nginx-ingress-controller 实例的每秒请求数
sum(irate(nginx_ingress_controller_requests{}[1m])) by (controller_namespace,controller_pod)
nginx-ingress 请求时延
histogram_quantile(0.90, sum(rate(nginx_ingress_controller_request_duration_seconds_bucket{status="200"}[1m])) by (le, ingress, host, path)) * 1000
histogram_quantile(0.95, sum(rate(nginx_ingress_controller_request_duration_seconds_bucket{status="200"}[1m])) by (le, ingress, host, path)) * 1000
histogram_quantile(0.99, sum(rate(nginx_ingress_controller_request_duration_seconds_bucket{status="200"}[1m])) by (le, ingress, host, path)) * 1000
请求成功率
sum(rate(nginx_ingress_controller_requests{status!~"[4-5].*"}[5m])) / sum(rate(nginx_ingress_controller_requests{}[5m])) * 100
通用
可用性
统计 7 天内服务的可用性
sum_over_time(up{}[7d]) / count_over_time(up{}[7d])