目录

kubernetes监控-Prometheus 监控kubernetes集群

自建 Prometheus 监控 kubernetes 集群.

动机

谷歌 gke 集群不像国内云商用户体验这么好,监控也不好搞,而且他官方的文档全都是基于他自家云产品的用法,主要是用他家的产品价格也不便宜,于是我们用集群外自建 Prometheus 接入 gke 的方式,来实现监控告警。

要实现上述的方法监控 gke 集群,需要对 3 个地方的配置进行设置:

  • kube-state-metrics

  • cadvisor

  • node-exporter

起初是针对 gke 才记录的这些方法,后面发现这些监控方法也同样适用于自建的 kubernetes 集群。

sa token

创建 sa 账号,给予指定权限,我这里贪方便,用的 cluster-admin 的权限。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
kubectl apply -f - <<EOF
apiVersion: v1
kind: ServiceAccount
metadata:
  name: ksa-prometheus
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cluster-admins
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
- kind: ServiceAccount
  name: ksa-prometheus
  namespace: default
EOF

kubernetes 新版本创建 serviceAccount 的时候已经不会自动生成 secret 了,如果确认需要 token ,需要我们自己手动创建 secret 让他自己填充。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
EOF
kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
  annotations:
    kubernetes.io/service-account.name: ksa-prometheus
  name: ksa-prometheus-token
  namespace: default
type: kubernetes.io/service-account-token
EOF

拿到 token 以后,后面的操作才能继续。

kube-state-metrics

kube-state-metrics 是获取 pod 相关指标的重要组件。

部署

git clone kube-state-metrics 的仓库下来,修改 examples/standard/service.yaml 为适合自己集群的负载方式,我这里用的 lb 。

1
2
3
4
git clone https://github.com/kubernetes/kube-state-metrics.git

cd kube-state-metrics
kubectl apply -f examples/standard

配置

Prometheus 配置。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
- job_name: 'kube-state-metrics'
  honor_timestamps: true
  scrape_interval: 60s
  scrape_timeout: 30s
  metrics_path: /metrics
  scheme: http
  static_configs:
    - targets: ['10.131.xxx.xxx:8080']
  metric_relabel_configs:
  - target_label: cluster
    replacement: gke

cadvisor

cadvisor 是获取容器相关指标的重要组成部分。

Prometheus 配置。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
- job_name: k8s-cadvisor
  honor_timestamps: true
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: https
  kubernetes_sd_configs:
  - api_server: https://34.xxx.xxx.xxx
    role: node
    bearer_token_file: /etc/prometheus/gke.token
    tls_config:
      insecure_skip_verify: true
  bearer_token_file: /etc/prometheus/gke.token
  tls_config:
    insecure_skip_verify: true
  relabel_configs:
  - source_labels: ['__meta_kubernetes_node_label_kubernetes_io_hostname']
    target_label: node
  - separator: ;
    regex: __meta_kubernetes_node_label_(.+)
    replacement: $1
    action: labelmap
  - separator: ;
    regex: (.*)
    target_label: __address__
    replacement: 34.xxx.xxx.xxx
    action: replace
  - source_labels: [__meta_kubernetes_node_name]
    separator: ;
    regex: (.+)
    target_label: __metrics_path__
    replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
    action: replace
  metric_relabel_configs:
  - target_label: cluster
    replacement: gke

node-exporter

node-exporter 是获取宿主机监控指标的部分。

部署 node-exporter。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
apiVersion: apps/v1
kind: DaemonSet
metadata:
  namespace: gmp-public
  name: node-exporter
  labels:
    app.kubernetes.io/name: node-exporter
    app.kubernetes.io/version: 1.3.1
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: node-exporter
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 10%
  template:
    metadata:
      labels:
        app.kubernetes.io/name: node-exporter
        app.kubernetes.io/version: 1.3.1
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/arch
                operator: In
                values:
                - arm64
                - amd64
              - key: kubernetes.io/os
                operator: In
                values:
                - linux
      containers:
      - name: node-exporter
        image: quay.io/prometheus/node-exporter:v1.3.1
        args:
        - --web.listen-address=:9100
        - --path.sysfs=/host/sys
        - --path.rootfs=/host/root
        - --no-collector.wifi
        - --no-collector.hwmon
        - --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/)
        - --collector.netclass.ignored-devices=^(veth.*)$
        - --collector.netdev.device-exclude=^(veth.*)$
        ports:
        - name: metrics
          containerPort: 9100
        resources:
          limits:
            memory: 180Mi
          requests:
            cpu: 102m
            memory: 180Mi
        volumeMounts:
        - mountPath: /host/sys
          mountPropagation: HostToContainer
          name: sys
          readOnly: true
        - mountPath: /host/root
          mountPropagation: HostToContainer
          name: root
          readOnly: true
      hostNetwork: true
      hostPID: true
      securityContext:
        runAsNonRoot: true
        runAsUser: 65534
      volumes:
      - hostPath:
          path: /sys
        name: sys
      - hostPath:
          path: /
        name: root

Prometheus 配置。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
- job_name: gke-node
  scheme: http
  tls_config:
    insecure_skip_verify: true
  bearer_token_file: /etc/prometheus/gke.token
  kubernetes_sd_configs:
  - role: node
    api_server: https://34.xxx.xxx.xxx
    tls_config:
      insecure_skip_verify: true
    bearer_token_file: /etc/prometheus/gke-mota.token
  relabel_configs:
    - source_labels: [__address__]
      regex: '(.*):10250'
      replacement: '${1}:9100'
      target_label: __address__
      action: replace
    - action: labelmap
      regex: __meta_kubernetes_node_label_(.+)
  metric_relabel_configs:
  - target_label: cluster
    replacement: gke

apiserver

Prometheus 配置。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
- job_name: 'k8s-apiservers'
  scheme: https
  tls_config:
    insecure_skip_verify: true
  bearer_token_file: /prometheus/tokens/dev-xxx.token
  kubernetes_sd_configs:
  - role: endpoints
    api_server: https://34.xxx.xxx.xxx
    tls_config:
      insecure_skip_verify: true
    bearer_token_file: /prometheus/tokens/dev-xxx.token
  relabel_configs:
  - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
    action: keep
    regex: default;kubernetes;https
  metric_relabel_configs:
  - target_label: cluster
    replacement: dev-xxx
  - target_label: env
    replacement: dev

etcd

注意:这里不是 gke 的监控,这里是针对自建集群的 etcd 监控

etcd 步骤多一点,首先得关联 svc 和 etcd 的 endpoint,然后下载证书到 prometheus 服务器。

svc endpoint 关联。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
apiVersion: v1
kind: Service
metadata:
  name: kube-etcd-svc
  namespace: kube-system
  labels:
    component: etcd
spec:
  type: ClusterIP
  clusterIP: None
  ports:
  - name: port
    port: 2379
    protocol: TCP

---
apiVersion: v1
kind: Endpoints
metadata:
  name: kube-etcd-svc
  namespace: kube-system
  labels:
    component: etcd
subsets:
- addresses:
  - ip: 192.168.2.232
  - ip: 192.168.2.233
  - ip: 192.168.2.234
  ports:
  - name: port
    port: 2379
    protocol: TCP

Prometheus 配置。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
- job_name: "k8s-etcd"
  scheme: https
  tls_config:
    ## 配置 ETCD 证书所在路径(Prometheus 容器内的文件路径)
    ca_file: ./certs/ca.crt
    cert_file: ./certs/healthcheck-client.crt
    key_file: ./certs/healthcheck-client.key
    insecure_skip_verify: false
  kubernetes_sd_configs:
  ## 配置服务发现机制,指定 ETCD Service 所在的Namespace名称
  - role: endpoints
    api_server: https://34.xxx.xxx.xxx
    tls_config:
      insecure_skip_verify: true
    namespaces:               
      names: ["kube-system"]         
    bearer_token_file: /prometheus/tokens/dev-xxx.token
  relabel_configs:
  ## 指定从 component 标签等于 etcd 的 service 服务获取指标信息
  - action: keep
    source_labels: [__meta_kubernetes_service_label_component]
    regex: etcd
  metric_relabel_configs:
  - target_label: cluster
    replacement: dev-xxx
  - target_label: env
    replacement: dev

coredns

k8s 本身已经有 kube-dns 这个 svc 了,我们不用像 etcd 一样新建一个,直接利用就可以了。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
  - job_name: "k8s-coredns"
    tls_config:
      insecure_skip_verify: true
    kubernetes_sd_configs:
    - role: endpoints
      api_server: https://devk8s.yixiukeji.com.cn:6443
      tls_config:
        insecure_skip_verify: true
      namespaces:
        names: ["kube-system"]
      bearer_token_file: /prometheus/tokens/dev-yixiukeji.token
    relabel_configs:
    - action: keep
      source_labels: [__meta_kubernetes_service_label_k8s_app, __meta_kubernetes_endpoint_port_name]
      regex: kube-dns;metrics
    metric_relabel_configs:
    - target_label: cluster
      replacement: dev-yixiukeji
    - target_label: env
      replacement: dev

scheduler

master 节点修改 vim /etc/kubernetes/manifests/kube-scheduler.yaml ,将监听地址修改一下。

1
2
3
...
- --bind-address=0.0.0.0
...

需要配置一下 svc ,给 prometheus 抓取,以往版本端口是 10251,高版本变成 10259 了,。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
apiVersion: v1
kind: Service
metadata:
  labels:
    k8s-app: kube-scheduler
  name: kube-scheduler
  namespace: kube-system
spec:
  internalTrafficPolicy: Cluster
  ipFamilies:
    - IPv4
  ipFamilyPolicy: SingleStack
  ports:
    - name: http-metrics
      port: 10259
      protocol: TCP
      targetPort: 10259
  selector:
    component: kube-scheduler
  sessionAffinity: None
  type: ClusterIP

prometheus 配置。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
  - job_name: "k8s-scheduler "
    scheme: https
    bearer_token_file: /prometheus/tokens/dev-yixiukeji.token
    tls_config:
      insecure_skip_verify: true
    kubernetes_sd_configs:
    - role: endpoints
      api_server: https://devk8s.yixiukeji.com.cn:6443
      tls_config:
        insecure_skip_verify: true
      namespaces:
        names: ["kube-system"]
      bearer_token_file: /prometheus/tokens/dev-yixiukeji.token
    relabel_configs:
    - action: keep
      source_labels: [__meta_kubernetes_service_label_k8s_app, __meta_kubernetes_endpoint_port_name]
      regex: kube-scheduler;http-metrics
    metric_relabel_configs:
    - target_label: cluster
      replacement: dev-yixiukeji
    - target_label: env
      replacement: dev

总结

总的来说,没有 prometheus operator 方便,不过也难不倒我们,就多费一点心思而已,然后接下来就是配置监控大盘了。