day06 prometheus

Prometheus

新一代容器云监控系统,prometheus+granfa

1.云原生(Cloud Native)程序的概念

容器的封装,可以让应用程序脱离原先依赖的基础设施,进行解耦,不再强依赖操作系统
封装统一,docker run,

配置是独立于程序的可配变量,同一份程序在不同配置下会有不同的行为

  • 程序的配置,通过设置环境变量传递到容器内部(docker -e 环境变量)
  • 程序的配置,通过程序启动参数配置生效(java -D -jar a.jar)
  • 程序的配置,通过集中在配置中心进行统一管理CRUD(k8s有configMap)
    用一个镜像,在不同环境,用不同的环境变量+启动参数都能生效

2.Devops工程师该做什么

  • 容器化公司自研应用程序(通过Docker进行二次封装)
  • 推动容器化应用,转变为云原生应用(一次构建,到处使用)
  • 使用容器编排框架(kubernetes),合理,规范,专业的编排业务容器

3.课程大纲

4.Prometheus(CNCF云原生基金会,里面有k8s,prometheus项目)

google的前员工写的
官网https://Prometheus.io
源码https://github.com/prometheus/prometheus

5.Prometheus的特点

  • 多维数据模型: 由度量名称和键值对标识的时间序列数据(类似hbase中的多列)
  • 内置时间序列数据库: TSDB,时间序列数据库(hbase,mongdb),关系型数据库mysql最优秀的就是支持强事务
  • promQL: 一种灵活的查询语言,可以利用多维数据完成复杂查询(不同的查询检索和过滤条件,去查到监控的指标,查到的是度量名)
这个还是很难的.不同于以往认知的结构化查询语言sql

一系列查询条件,过滤出的结果,并不是一个数值,而是一个函数

  • 基于http的pull方法采集时间序列数据(exporter)
prometheus也是客户端,服务端2个结构,客户端代理端叫做exporter(拉数据)
exporter和zabbix agent有什么区别?
exporter只能提供http接口,对于mysql和redis要下专门的exporter,exporter里面进行配置去连mysql和redis,exporter自己再暴露http接口
监控容器云,要用到4个exporter
  • 同时支持PushGateway组件收集数据
特定的场景
转瞬即逝的任务,这种就需要自己主动推数据,就用到了PushGateWay
生产中用的极少
  • 通过服务发现或静态配置发现目标
静态配置容易一些,占课程3%
服务发现,自动发现占课程97
  • 多种图形模式及仪表盘支持
  • 支持作为数据源接入Grafana

6.Prometheus架构


k8s_sd:基于k8s的原数据,做服务发现,课程重中之重
file_sd: 应用不在k8s里,通过文件来实现

  • Retrieval : 收集数据中心
  • TSDB: 时间数据库,落SSD里
  • HTTP server: 后台自己提供了一个http server(1是可以给grafana提供数据查询的接口2是可以给Alertmanager提供数据支持)

7.Prometheus和zabbix对比


监控容器,zabbix 的自动发现搞不定容器的监控,k8s里面更新代码后,删一个pod,拉新的pod起来,这个动作,zabbix会认为老机器down机报警,新机器新加进来的,我要监控

8.kube-state-metrics这个exporter

8.1 说明

kube-state-metrics,这个是用来收集K8S集群基本状态信息的监控代理
k8s几个节点,节点有几个deployment,deployment更新过几版,由kube-state-metrics

8.2 下载镜像

代码托管在quay.io的容器仓库里,redhat提供的,类似于docker.io

# 运维主机做镜像,推镜像
[root@jdss7-200]# docker pull quay.io/coreos/kube-state-metrics:v1.5.0
[root@jdss7-200]# docker images | grep kube-state
[root@jdss7-200]# docker tag 91599517197a harbor.od.com/public/kube-state-metrics:v1.5.0
[root@jdss7-200]# cd /data/k8s-yaml/
[root@jdss7-200]# mkdir -pv kube-state-metrics
[root@jdss7-200]# 

8.3 准备资源配置清单rbac.yaml及dp.yaml

rbac.yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/cluster-service: "true"
  name: kube-state-metrics
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/cluster-service: "true"
  name: kube-state-metrics
rules:
- apiGroups:
  - ""
  resources:
  - configmaps
  - secrets
  - nodes
  - pods
  - services
  - resourcequotas
  - replicationcontrollers
  - limitranges
  - persistentvolumeclaims
  - persistentvolumes
  - namespaces
  - endpoints
  verbs:
  - list
  - watch
- apiGroups:
  - policy
  resources:
  - poddisruptionbudgets
  verbs:
  - list
  - watch
- apiGroups:
  - extensions
  resources:
  - daemonsets
  - deployments
  - replicasets
  verbs:
  - list
  - watch
- apiGroups:
  - apps
  resources:
  - statefulsets
  verbs:
  - list
  - watch
- apiGroups:
  - batch
  resources:
  - cronjobs
  - jobs
  verbs:
  - list
  - watch
- apiGroups:
  - autoscaling
  resources:
  - horizontalpodautoscalers
  verbs:
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/cluster-service: "true"
  name: kube-state-metrics
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kube-state-metrics
subjects:
- kind: ServiceAccount
  name: kube-state-metrics
  namespace: kube-system  

dp.yaml
里面有一个readlinessProbe就绪性探针的东西,容器满足就绪性探针条件的时候,判定容器才是启动,如果判定条件不满足,不会调度流量的

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "2"
  labels:
    grafanak8sapp: "true"
    app: kube-state-metrics
  name: kube-state-metrics
  namespace: kube-system
spec:
  selector:
    matchLabels:
      grafanak8sapp: "true"
      app: kube-state-metrics
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      labels:
        grafanak8sapp: "true"
        app: kube-state-metrics
    spec:
      containers:
      - name: kube-state-metrics
        image: harbor.od.com/public/kube-state-metrics:v1.5.0
        imagePullPolicy: Always
        ports:
        - containerPort: 8080
          name: http-metrics
          protocol: TCP
        readinessProbe: # 就绪性探针
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
      serviceAccountName: kube-state-metrics 

8.4 应用资源配置清单

kubectl apply -f http://k8s-yaml.od.com/kube-state-metrics/rbac.yaml
kubectl apply -f http://k8s-yaml.od.com/kube-state-metrics/dp.yaml

8.5 验证

在kube-system名称空间

[root@jdss7-21 ~]# kubectl get pods -n kube-system -o wide
NAME                                    READY   STATUS    RESTARTS   AGE    IP           NODE                NOMINATED NODE   READINESS GATES
coredns-6b6c4f9648-zszt9                1/1     Running   1          83d    172.7.22.4   jdss7-22.host.com   <none>           <none>
kube-state-metrics-585576bc4-bsb9t      1/1     Running   0          3m4s   172.7.22.9   jdss7-22.host.com   <none>           <none>
kubernetes-dashboard-76dcdb4677-t7m7t   1/1     Running   1          64d    172.7.22.5   jdss7-22.host.com   <none>           <none>
traefik-ingress-8bk4l                   1/1     Running   1          81d    172.7.21.2   jdss7-21.host.com   <none>           <none>
traefik-ingress-bzmw6                   1/1     Running   1          81d    172.7.22.3   jdss7-22.host.com   <none>           <none>
[root@jdss7-21 ~]# curl 'http://172.7.22.9:8080/healthz'
ok[root@jdss7-21 ~]curl 'http://172.7.22.9:8080/metrics'

9.node-exporter

9.1 准备镜像

收集k8s运算节点,基础设施信息的,cpu使用率,内存使用率,磁盘使用率,网络io
还是用0.15.0这个版本

[root@jdss7-200]# docker pull prom/node-exporter:v0.15.0
[root@jdss7-200]# docker images |grep node-exporter
[root@jdss7-200]# docker tag 12d51ffa2b22 harbor.od.com/public/node-exporter:v0.15.0
[root@jdss7-200]# docker push harbor.od.com/public/node-exporter:v0.15.0
[root@jdss7-200]# cd /data/k8s-yaml/
[root@jdss7-200]# mkdir -pv node-exporter

9.2 准备资源配置清单

ds.yaml

kind: DaemonSet
apiVersion: extensions/v1beta1
metadata:
  name: node-exporter
  namespace: kube-system
  labels:
    daemon: "node-exporter"
    grafanak8sapp: "true"
spec:
  selector:
    matchLabels:
      daemon: "node-exporter"
      grafanak8sapp: "true"
  template:
    metadata:
      name: node-exporter
      labels:
        daemon: "node-exporter"
        grafanak8sapp: "true"
    spec:
      volumes:
      - name: proc
        hostPath:
          path: /proc
          type: ""
      - name: sys
        hostPath:
          path: /sys
          type: ""
      containers:
      - name: node-exporter
        image: harbor.od.com/public/node-exporter:v0.15.0
        imagePullPolicy: Always
        args:
        - --path.procfs=/host_proc
        - --path.sysfs=/host_sys
        ports:
        - name: node-exporter
          hostPort: 9100
          containerPort: 9100
          protocol: TCP
        volumeMounts:
        - name: sys
          readOnly: true 
          mountPath: /host_sys
        - name: proc
          readOnly: true 
          mountPath: /host_proc
      hostNetwork: true # 用了宿主机的网络名称空间

9.3 应用资源配置清单

kubectl apply -f http://k8s-yaml.od.com/node-exporter/ds.yaml

9.4 验证

curl http://localhost:9100/metrics

10.cadvisor

监控我们容器内部使用资源的重要exporter工具
从外部探测,容器消耗了多少cpu,多少资源,原本是和kubelet集成到一起的,1.9版本以后拆开了
cadvisor去和kubelet要数据,kubelete要和docker引擎去要数据
使用版本0.28.3

10.1 下载镜像

k8s是1.9版本以前的,cadvisor自己就配套起了,不用额外再配置

[root@jdss7-200]# docker pull google/cadvisor:v0.28.3
[root@jdss7-200]# docker images | grep cadvisor
[root@jdss7-200]# docker tag 75f00e3ec333 harbor.od.com/public/cadvisor:v0.28.3
[root@jdss7-200]# docker push harbor.od.com/public/cadvisor:v0.28.3
[root@jdss7-200]# cd /data/k8s-yaml/
[root@jdss7-200]# mkdir -p cadvisor

10.2 人为影响K8S调度策略的三种方法

  • 污点,容忍度方法
    • 污点: 运算节点node上的污点
      kubectl taint node jdss7-21.host.com node-role.kubernetes.io/master=master:NoSchedule
      kubectl get nodes -o wide
      # 删污点
      kubectl taint node jdss7-21.host.com node-role.kubernetes.io/master-
    
    • 容忍度: pod是否能够容忍污点
  • nodeName: 让Pod运行在指定的node上
  • nodeSelector: 标签选择器,让pod运行在指定的一类node上

10.3 创建资源配置清单

ds.yaml

kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: cadvisor
  namespace: kube-system
  labels:
    app: cadvisor
spec:
  selector:
    matchLabels:
      name: cadvisor
  template:
    metadata:
      labels:
        name: cadvisor
    spec:
      hostNetwork: true  # 用了宿主机的名称空间
      tolerations: # 容忍污点(用户本身人为影响k8s调度的重要方法,可以干预kube-schedule)
      - key: node-role.kubernetes.io/master # 发现是主控节点
        effect: NoSchedule # 就不调度
      containers:
      - name: cadvisor
        image: harbor.od.com/public/cadvisor:v0.28.3
        imagePullPolicy: Always
        volumeMounts:
        - name: rootfs
          readOnly: true 
          mountPath: /rootfs
        - name: var-run
          mountPath: /var/run
        - name: sys
          mountPath: /sys
          readOnly: true
        - name: docker
          mountPath: /var/lib/docker
          readOnly: true
        ports:
        - name: http
          containerPort: 4194
          protocol: TCP
        readinessProbe:
          tcpSocket:
            port: 4194
          initialDelaySeconds: 5
          periodSeconds: 10
        args:
        - --housekeeping_interval=10s
        - --port=4194
      terminationGracePeriodSeconds: 30
      volumes:
      - name: rootfs
        hostPath:
          path: /
      - name: var-run
        hostPath:
          path: /var/run
      - name: sys
        hostPath:
          path: /sys
      - name: docker
        hostPath:
          path: /data/docker

10.4 修改运算节点的软连接

mount -o remount,rw /sys/fs/cgroup/
ln -s /sys/fs/cgroup/cpu,cpuacct /sys/fs/cgroup/cpuacct,cpu # 容器里面会用到这个

最后效果

10.5 加载资源配置清单

kubectl apply -f http://k8s-yaml.od.com/cadvisor/ds.yaml
kubectl get pods -n kube-system

10.6 起着的端口是4194,cadvisor

11.blackbox-exporter

非常重要的作用,帮你探明你业务容器是否存活,容器是否down机的工具

11.1 准备镜像

[root@jdss7-200]# docker pull  prom/blackbox-exporter:v0.15.1
[root@jdss7-200]# docker images | grep blackbox
[root@jdss7-200]# docker tag 81b70b6158be harbor.od.com/public/blackbox-exporter:v0.15.1
[root@jdss7-200]# docker push harbor.od.com/public/blackbox-exporter:v0.15.1

11.2 准备资源配置清单

cd /data/k8s-yaml/blackbox-exporter
cm.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  labels:
    app: blackbox-exporter
  name: blackbox-exporter
  namespace: kube-system
data:
  blackbox.yaml: |-
    modules:
      http_2xx:
        prober: http
        timeout: 2s
        http:
          valid_http_versions: ["HTTP/1.1","HTTP/2"]
          valid_status_codes: [200,301,302]
          method: GET
          preferred_ip_protocol: "ip4"
      tcp_connect:
        prober: tcp
        timeout: 2s  

dp.yaml

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: 1
  labels:
    app: blackbox-exporter
  name: blackbox-exporter
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: blackbox-exporter
  template:
    metadata:
      labels:
        app: blackbox-exporter
    spec:
      volumes:
      - name: config
        configMap:
          name: blackbox-exporter
          defaultMode: 420
      containers:
      - name: blackbox-exporter
        image: harbor.od.com/public/blackbox-exporter:v0.15.1
        imagePullPolicy: Always
        args:
        - --config.file=/etc/blackbox_exporter/blackbox.yaml
        - --log.level=info
        - --web.listen-address=:9115
        ports:
        - containerPort: 9115
          name: blackbox-port
          protocol: TCP
        resources:
          limits:
            cpu: 200m
            memory: 256Mi
          requests:
            cpu: 100m
            memory: 50Mi
        volumeMounts:
        - name: config
          mountPath: /etc/blackbox_exporter        
        readinessProbe:
          tcpSocket:
            port: 9115
          initialDelaySeconds: 5
          timeoutSeconds: 5
          periodSeconds: 10
          successThreshold: 1
          failureThreshold: 3   

svc.yaml

kind: Service
apiVersion: v1
metadata:
  name: blackbox-exporter
  namespace: kube-system
spec:
  selector:
    app: blackbox-exporter
  ports:
    - name: blackbox-port
      protocol: TCP
      port: 9115

ingress.yaml

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: blackbox-exporter
  namespace: kube-system
spec:
  rules:
  - host: blackbox.od.com
    http:
      paths:
      - path: /
        backend:
          serviceName: blackbox-exporter
          servicePort: blackbox-port

11.3 应用资源配置清单

kubectl apply -f http://k8s-yaml.od.com/blackbox-exporter/cm.yaml
kubectl apply -f http://k8s-yaml.od.com/blackbox-exporter/dp.yaml
kubectl apply -f http://k8s-yaml.od.com/blackbox-exporter/svc.yaml
kubectl apply -f http://k8s-yaml.od.com/blackbox-exporter/ingress.yaml

12.污点容忍度调度

就不想让这个pod调度到jdss7-21机器上

获取deployment

kubectl get deploy nginx-dp -o yaml -n kube-public > nginx-dp.yaml

说明:

dp.yaml中要注意的配置项
dnsPolicy: ClusterFirst // 代表走coreDns
dnsPllicy: Default // 代表走宿主机的/etc/resolv.conf

12.1 给7-21的机器上加一个污点,还能否调度到7-21机器上

kubectl taint node jdss7-21.host.com quedian=buxijiao:NoSchedule
# 打了一个污点,key是quedian,value是buxiaojiao,遇到这个污点的调度策略是不调度NoSchedule
kubectl describe node jdss7-21.host.com

在dashboard上执行扩容

发现还是只调度到jdss7-22的机器上了

12.2 tolerations配置项可以容忍污点,接受你污点,即使你不洗脚,我也可以接受,我也要调度

tolerations翻译过来就是容忍
deployment.yaml

template:
  spec:
    tolerations:
    - key: quedian
      value: buxiaojiao
      effect: NoSchedule

12.3 给7-21的机器上删除名为quedian的污点

kubectl taint node jdss7-21.host.com quedian-

12.4 给7-21的机器上添加名为quedian的污点,调度策略为NoExecute

NoExecute:代表你可以调度到我这里,但是不允许把pod拉起来

template:
  spec:
    tolerations:
    - key: quedian
      value: buxiaojiao
      effect: NoSchedule

12.5 Prometheus的正确玩法

给prometheus单独开一个运算节点,这个运算节点只能跑prometheus
这个运算节点,打个污点,只有prometheus可以容忍污点,
所以prometheus就可以单独跑到这个运算节点上了

12.6 污点引出标签,及标签选择器的使用场景

给运算节点打标签,A机器的disktype=ssd,B机器的disktype=sas
io密集的应用,人为调度加nodeSelector,加标签选择,就尽量往ssd类型的机器上调度

12.7 说明

污点和容忍度用的还是比较少的,也就prometheus用,别的场景基本不会用

12.8 NoExecute

节点要下线维修,可以给jdss7-21的机器上加一个污点,策略是NoExecute,将上面起的pod机器都排干净,保持没流量了
然后k8s集群中再摘了这个jdss7-21机器,就可以下线维修了
或者发现应用调度到了非ssd的机器上,性能不好

12.9 NoSchedule

是新建的时候不调度,已有在运行的,不会受此影响

12.10 给node加多个污点

kubectl taint node jdss7-21.host.com qu=buxiaojiao:NoSchedule
kubectl taint node jdss7-21.host.com buxiaozao=:NoSchedule // 只写key名字buxiaozao,不写value

13.安装部署prometheus-server

13.1 准备镜像

[root@jdss7-200]# docker pull prom/prometheus:v2.14.0
[root@jdss7-200]# docker images | grep prometheus
[root@jdss7-200]# docker tag 7317640d55e harbor.od.com/infra/prometheus:v2.14.0
[root@jdss7-200]# docker push harbor.od.com/infra/prometheus:v2.14.0

13.2 准备资源配置清单

mkdir -p /data/k8s-yaml/prometheus
rbac.yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/cluster-service: "true"
  name: prometheus
  namespace: infra
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/cluster-service: "true"
  name: prometheus
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  - nodes/metrics
  - services
  - endpoints
  - pods
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
- nonResourceURLs:
  - /metrics
  verbs:
  - get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/cluster-service: "true"
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: infra

dp.yaml

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "5"
  labels:
    name: prometheus
  name: prometheus
  namespace: infra
spec:
  progressDeadlineSeconds: 600
  replicas: 1 # 只能起一份,因为有持久化数据tsdb
  revisionHistoryLimit: 7
  selector:
    matchLabels:
      app: prometheus
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      nodeName: jdss7-21.host.com # 人为介入,让跑到jdss7-21机器上
      containers:
      - name: prometheus
        image: harbor.od.com/infra/prometheus:v2.14.0
        imagePullPolicy: Always
        command:
        - /bin/prometheus
        args:
        - --config.file=/data/etc/prometheus.yaml
        - --storage.tsdb.path=/data/prom-db
        - --storage.tsdb.retention=72h # tsdb里到底要存多久的数据
        - --storage.tsdb.min-block-duration=10m # 只加载10分钟内的数据到内存
        ports:
        - containerPort: 9090
          protocol: TCP
        volumeMounts:
        - mountPath: /data
          name: data        
        resources:
          requests: # 容器一起来就要申请多少资源
            cpu: "1000m" # 1000豪核,就是1核
            memory: "1.5Gi" 
          limits: # 当容器资源达到了limits量的时候,k8s就把你杀死,容器oom
            cpu: "2000m"
            memory: "3Gi"        
      imagePullSecrets:
      - name: harbor
      securityContext:
        runAsUser: 0
      serviceAccountName: prometheus
      volumes:
      - name: data
        nfs:
          server: jdss7-200
          path: /data/nfs-volume/prometheus

svc.yaml

kind: Service
apiVersion: v1
metadata:
  name: prometheus
  namespace: infra
spec:
  selector:
    app: prometheus
  ports:
    - protocol: TCP
      port: 9090
      targetPort: 9090

Ingress.yaml

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: traefik
  name: prometheus
  namespace: infra
spec:
  rules:
  - host: prometheus.od.com
    http:
      paths:
      - path: /
        backend:
          serviceName: prometheus
          servicePort: 9090

13.3 准备nfs目录

[root@jdss7-200]# cd /data/nfs-volume/
[root@jdss7-200]# mkdir -p prometheus
[root@jdss7-200]# mkdir -p prometheus/{etc,prom-db}

13.4 拷贝证书

# prometheus 要和apiServer通信,所以依赖证书
[root@jdss7-200]# cp /opt/certs/ca.pem /data/nfs-volume/prometheus/etc/
[root@jdss7-200]# cp /opt/certs/client.pem /data/nfs-volume/prometheus/etc/
[root@jdss7-200]# cp /opt/certs/client-key.pem /data/nfs-volume/prometheus/etc/  #client-key.pem 私钥,权限一定要是600

13.5 准备prometheus的配置文件

prometheus.yaml

global:
  scrape_interval: 15s # 多久收集一次数据
  scrape_timeout: 10s # 每次收集数据的超时时间
  evaluation_interval: 15s # 多久评估一次规则
    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
scrape_configs:
- job_name: etcd
  honor_timestamps: true
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: https
  static_configs: # 只有etcd用的静态配置,别的都是自动发现
  - targets:
    - 10.4.7.12:2379
    - 10.4.7.21:2379
    - 10.4.7.22:2379
  tls_config:
    ca_file: /data/etc/ca.pem
    cert_file: /data/etc/client.pem
    key_file: /data/etc/client-key.pem
    insecure_skip_verify: false
- job_name: 'kubernetes-apiververs'
  kubernetes_sd_configs:
  - role: endpoints
  scheme: https
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  relabel_configs:
  - source_labels: [__meta_kubernetes_namespace,__meta_kubernetes_service_name,__meta_kubernetes_endpoint_port_name]
    action: keep
    regex: default;kubernetes;https
- job_name: 'kubernetes-pods'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    action: keep
    regex: true
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.*)
  - source_labels: [__address__,__meta_kubernetes_pid_annotation_prometheus_io_port]
    action: replace
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    target_label: __address__
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    replacement: $1:$2
    target_label: __address__
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: kubernetes_namespace
  - source_labels: [__meta_kubernetes_pod_name]
    action: replace
    target_label: kubernetes_pod_name
- job_name: 'kubernetes-kubelete'
  kubernetes_sd_configs:
  - role: node
  relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
  - source_labels: [__meta_kubernetes_node_name]
    regex: (.+)
    target_label: __address__
    replacement: ${1}:10255
- job_name: 'kubernetes-cadvisor'
  kubernetes_sd_configs:
  - role: node
  relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
  - source_labels: [__meta_kubernetes_node_name]
    regex: (.+)
    target_label: __address__
    replacement: ${1}:4194
- job_name: kubernetes-kube-state
  honor_timestamps: true
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs: #k8s service discovery configs 服务发现配置
  - role: pod # 能够发现k8s集群里面所有的pod
  relabel_configs: # 不能让所有的pod都进入我的job里来,我只能过滤一些进到我的job里来
  - separator: ;
    regex: __meta_kubernetes_pod_lebel_(.*)
    replacement: $1
    action: labelmap
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: kubernetes_namespace
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_name]
    separator: ;
    regex: (.*)
    target_label: kubernetes_pod_name
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_label_grafanak8sapp] # 元数据pod的一个标签,叫做grafanak8sapp的,匹配到,执行action动作
    separator: ;
    regex: ."true."
    replacement: $1
    action: keep  
  - source_labels: [__meta_kubernetes_pod_label_daemon,__meta_kubernetes_pod_node_name]
    separator: ;
    regex: node-exporter;(.*)
    target_label: nodename
    replacement: $1
    action: replace  
- job_name: 'blackbox_http_pod_probe'
  metrics_path: /probe
  kubernetes_sd_configs:
  - role: pod
  params:
    module: [http_2xx]
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_blackbox_scheme]
    action: keep
    regex: http
  - source_labels: [__address__,__meta_kubernetes_pod_annotation_blackbox_port,__meta_kubernetes_pod_annotation_blackbox_path]
    action: replace
    regex: ([^:]+)(?::\d+)?;(\d+);(.+)
    replacement: $1:$2$3
    target_label: __param_target
  - action: replace
    target_label: __address__
    replacement: blackbox-exporter.kube-system:9115   
  - source_labels: [__param_target]
    target_label: instance
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.*)
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: kubernetes_namespace
  - source_labels: [__meta_kubernetes_pod_name]
    action: replace
    target_label: kubernetes_pod_name
- job_name: 'blackbox_tcp_pod_probe'
  metrics_path: /probe
  kubernetes_sd_configs:
  - role: pod
  params:
    module: [tcp_connect]
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_blackbox_scheme]
    action: keep
    regex: tcp
  - source_labels: [__address__,__meta_kubernetes_pod_annotation_blackbox_port]
    action: replace
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    target_label: __param_target
  - action: replace
    target_label: __address__
    replacement: blackbox-exporter.kube-system:9115   
  - source_labels: [__param_target]
    target_label: instance
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.*)
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: kubernetes_namespace
  - source_labels: [__meta_kubernetes_pod_name]
    action: replace
    target_label: kubernetes_pod_name   
- job_name: traefik
  honor_timestamps: true
  scrape_interval: 15s
  scrape_timeout: 10s
  scheme: http
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    separator: ;
    regex: traefik
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    separator: ;
    regex: (.+)
    target_label: __metrics_path__
    replacement: $1
    action: replace   
  - source_labels: [__address__,__meta_kubernetes_pod_annotation_prometheus_io_port]
    separator: ;
    regex: ([^:]+)(?::\d+)?;(\d+)0
    target_label: __address__
    replacement: $1:$2
    action: replace   
  - separator:
    regex: __meta_kubernetes_pod_label_(.*)
    replacement: $1
    action: labelmap
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: kubernetes_namespace  

13.6 加载资源配置清单

kubectl apply -f http://k8s-yaml.od.com/prometheus/rbac.yaml
kubectl apply -f http://k8s-yaml.od.com/prometheus/dp.yaml
kubectl apply -f http://k8s-yaml.od.com/prometheus/svc.yaml
kubectl apply -f http://k8s-yaml.od.com/prometheus/ingress.yaml

13.7 验证

prometheus.od.com

看看status的标签页,Congiguration就是配置文件,
Targets

label_configs里面的action的意义

13.8 进行检索查询


node_cpu{nodename="jdss7-21.host.com"}

14.用Prometheus监控traefik

14.1 prometheus.yaml上面有一个traefik的jobName

- job_name: traefik
  honor_timestamps: true
  scrape_interval: 15s
  scrape_timeout: 10s
  scheme: http
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme] # pod注解里有prometheus_io_scheme的就监控
    separator: ;
    regex: traefik
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    separator: ;
    regex: (.+)
    target_label: __metrics_path__
    replacement: $1
    action: replace   
  - source_labels: [__address__,__meta_kubernetes_pod_annotation_prometheus_io_port]
    separator: ;
    regex: ([^:]+)(?::\d+)?;(\d+)0
    target_label: __address__
    replacement: $1:$2
    action: replace   
  - separator:
    regex: __meta_kubernetes_pod_label_(.*)
    replacement: $1
    action: labelmap
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: kubernetes_namespace    

14.2 traefik的pod控制器deployment添加注解说明

"annotations":{
  "prometheus_io_scheme":"traefik",
  "prometheus_io_path":"/metrics",
  "prometheus_io_port":"8080"
}

路径在和labels同级,template层级里

15.blackbox配套prometheus

15.1 jobName里blackbox_tcp_pod_probe

blackbox里依赖的pod控制器deployment添加
通过检测容器的端口,来判断容器的存活

"annotations":{
  "blackbox_port":"20880",
  "blackbox_scheme":"tcp"
}

15.2 jobName里blackbox_http_pod_probe

blackbox_http_pod_probe

blackbox里依赖的pod控制器deployment添加
通过检测容器的端口,来判断容器的存活

"annotations":{
  "blackbox_port":"8080",
  "blackbox_scheme":"http",
  "blackbox_path":"/"
}

16.用prometheus监控dubbo-demo-service,dubbo服务提供者的存活性

在dubbo-demo-service的pod控制器deployment.yaml中添加注解
路径在和labels同级,template层级里

"annotations":{
  "blackbox_port":"20880",
  "blackbox_scheme":"tcp"
}

然后prometheus的blackbox_tcp_pod_probe的jobName里就有数了
blackbox.od.com


blackbox-exporter.kube-system是k8s集群里面的service的名字

17.用prometheus监控jvm信息

kubernetes-pods
核心注解

"annotations":{
  "prometheus_io_scrape" : "true",
  "prometheus_io_port" : "12346",
  "prometheus_io_path": "/"
}

18.部署grafana

18.1 下载镜像5.4.7

[root@jdss7-200]# docker pull grafana/grafana:5.4.2
[root@jdss7-200]# docker images | grep grafana
[root@jdss7-200]# docker tag 6f18ddf9e552 harbor.od.com/infra/grafana:v5.4.2
[root@jdss7-200]# docker push harbor.od.com/infra/grafana:v5.4.2

18.2 准备资源配置清单

rbac.yaml

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/cluster-service: "true"
  name: grafana
rules:
- apiGroups:
  - "*"
  resources:
  - namespaces
  - deployments
  - pods
  verbs:
  - get
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/cluster-service: "true"
  name: grafana
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: grafana
subjects:
- kind: User
  name: k8s-node

dp.yaml

kind: Deployment
apiVersion: extensions/v1beta1
metadata:
  name: grafana
  namespace: infra
  labels:
    app: grafana
    name: grafana
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 7
  selector:
    matchLabels:
      name: grafana
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1      
  template:
    metadata:
      labels:
        app: grafana
        name: grafana
    spec:
      containers:
      - name: grafana
        image: harbor.od.com/infra/grafana:v5.4.2
        imagePullPolicy: Always
        ports:
        - containerPort: 3000
          protocol: TCP
        volumeMounts:
        - name: data
          mountPath: /var/lib/grafana
      imagePullSecrets:
      - name: harbor
      securityContext:
        runAsUser: 0
      volumes:
      - nfs: 
          server: jdss7-200
          path: /data/nfs-volume/grafana
        name: data

service.yaml

kind: Service
apiVersion: v1
metadata:
  name: grafana
  namespace: infra
spec:
  selector:
    app: grafana
  ports:
    - protocol: TCP
      port: 3000
      targetPort: 3000

ingress.yaml

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: grafana
  namespace: infra
spec:
  rules:
  - host: grafana.od.com
    http:
      paths:
      - path: /
        backend:
          serviceName: grafana
          servicePort: 3000

18.3 应用资源配置清单

kubectl apply -f http://k8s-yaml.od.com/grafana/rbac.yaml
kubectl apply -f http://k8s-yaml.od.com/grafana/dp.yaml
kubectl apply -f http://k8s-yaml.od.com/grafana/svc.yaml
kubectl apply -f http://k8s-yaml.od.com/grafana/ingress.yaml

18.4 测试

grafana.od.com

账户admin密码admin
改成admin123和admin123

18.5 配置

  1. Setting->Preferences
    OrganizationName改为OldboyEdu
    UI Theme 改为Light风格
    Home Dashboard 还是Default
    Timezone改为Local browser time
  2. Configuration -> Plugins
    安装Kubernetes App 插件
grafana-cli plugins install grafana-kubernetes-app
grafana-cli plugins install grafana-clock-panel
grafana-cli plugins install grafana-piechart-panel
grafana-cli plugins install briangann-gauge-panel
grafana-cli plugins install natel-discrete-panel
  1. 重启grafana容器,使得插件生效

  2. Configuration-> Add datasource
    按照图例添加datasource

    需要将CA Cert(7-200机器上cat /opt/certs/ca.pem)
    和Client Cert (7-200机器cat /opt/certs/client.pem)
    和Client Key(7-200机器cat /opt/certs/client-key.pem)


Basic Auth去掉,不勾选

  1. 配置Plugins里面的kubernetes
    enable,左侧多了一个图标
    new Cluster
    里面的http地址写的是apiServer的地址

    里面key的内容再贴一遍,datasource选prometheus,save一下

overview里就有dashboard了

  1. Configuration里面的datasource要test一下,不然grafana不去连datasource
  2. 这样dashboard就出来了
    Container是最有用的,但是需要额外修改一个参数,Container的dashboard

  1. 可以去grafana的官网,grafana.com/grafana/dashboard下载个星最多的,里面有编号
  2. import dashboard
    11074 load ,就用了别人的dashboard

19.AlertManager

19.1 先拉镜像

docker pull docker.io/prom/alertmanager:v0.14.0
docker images | grep alert
docker tag 30954e96cbe8 harbor.od.com/infra/alertmanager:v0.14.0
docker push harbor.od.com/infra/alertmanager:v0.14.0

19.2 准备资源配置清单

cm.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
  namespace: infra
data:
  config.yaml: |-
    global:
      # 在没有报价的情况下声明为已解决的时间
      resoev_timeout: 5m
      # 配置邮件发送信息
      smtp_xmarthost: 'xmtp.163.com:25'
      smtp_from: 'wa2319@163.com'
      smtp_auth_username: 'ws2319@163.com'
      smtp_auth_password: 'xxxxx'
      smtp_require_tls: false
    # 所有报价信息进入后的根路由,用来设置报价的分发策略
    route:
      # 这里的标签列表是接受报警信息后的重新分组标签,例如,接收的报警信息里有许多具有cluster=A和alertname=LatncyHigh这样的标签的报警信息
      group_by: ['alertname','cluster']
      # 当一个新的报警分组被创建后,需要等待至少group_wait时间来初始化通知,这种方式可以确保您有足够的时间为同一分组获取多个警报,然后一起触发报警
      group_wait: 30s
      # 当一个报警发送后,等待group_interval时间来发送新的一组报警信息
      group_interval: 5m
      # 如果一个报警信息以及发送成功了,等待repeat_interval时间来重新发送他们
      repeat_interval: 5m
      # 默认的receiver:如果一个报警没有被一个route匹配,则发送给默认的接收器
      receiver: default
    receivers:
    - name: 'default'
      email_configs:
      - to: '87527941@qq.com'

dp.yaml

kind: Deployment
apiVersion: extensions/v1beta1
metadata:
  name: alertmanager
  namespace: infra
spec:
  replicas: 1
  selector:
    matchLabels:
      app: alertmanager     
  template:
    metadata:
      labels:
        app: alertmanager
    spec:
      containers:
      - name: alertmanager
        image: harbor.od.com/infra/alertmanager:v0.14.0
        args:
        - --config.file=/etc/alertmanager/config.yml
        - --storage.path=/alertmanager
        imagePullPolicy: Always
        ports:
        - containerPort: 9090
          name: alertmanager
        volumeMounts:
        - name: alertmanager-cm
          mountPath: /etc/alertmanager
      imagePullSecrets:
      - name: harbor
      volumes:
      - name: alertmanager-cm
        configMap: jdss7-200
          name: alertmanager-config

svc.yaml

kind: Service
apiVersion: v1
metadata:
  name: alertmanager
  namespace: infra
spec:
  selector:
    app: alertmanager
  ports:
    - port: 80
      targetPort: 9093

19.3 应用资源配置清单

kubectl apply -f http://k8s-yaml.od.com/alertmanager/cm.yaml
kubectl apply -f http://k8s-yaml.od.com/alertmanager/dp.yaml
kubectl apply -f http://k8s-yaml.od.com/alertmanager/svc.yaml

19.4 prometheus的报警规则

vim /data/nfs-volume/prometheus/etc/rules.yml

ALERT goroutines_gt_70
  IF go_goroutines > 70
  FOR 5s  
  LABELS { status = "yellow" }
  ANNOTATIONS {
    summary = "goroutines 超过 70,当前值{{ $value }}",
    description = "当前实例 {{ $labels.instance }}",
  }

ALERT goroutines_gt_90
  IF go_goroutines > 90
  FOR 5s  
  LABELS { status = "red" }
  ANNOTATIONS {
    summary = "goroutines 超过 90,当前值{{ $value }}",
    description = "当前实例 {{ $labels.instance }}",
  }

19.5 添加prometheus的配置

vim /data/nfs-volume/prometheus/etc/prometheus.yml

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager"] # 这是serviceName

rule_files:
 - "/data/etc/rules.yml"

然后重启prometheus,不要删pod,强制调度到7-21机器上时,
kill -SITHUP 24705 // prometheus的pid

19.6 测试告警

停掉consumer

posted @ 2022-04-15 16:00  SpecialSpeculator  阅读(92)  评论(0编辑  收藏  举报