day06 prometheus
Prometheus
新一代容器云监控系统,prometheus+granfa
1.云原生(Cloud Native)程序的概念
容器的封装,可以让应用程序脱离原先依赖的基础设施,进行解耦,不再强依赖操作系统
封装统一,docker run,
配置是独立于程序的可配变量,同一份程序在不同配置下会有不同的行为
- 程序的配置,通过设置环境变量传递到容器内部(docker -e 环境变量)
- 程序的配置,通过程序启动参数配置生效(java -D -jar a.jar)
- 程序的配置,通过集中在配置中心进行统一管理CRUD(k8s有configMap)
用一个镜像,在不同环境,用不同的环境变量+启动参数都能生效
2.Devops工程师该做什么
- 容器化公司自研应用程序(通过Docker进行二次封装)
- 推动容器化应用,转变为云原生应用(一次构建,到处使用)
- 使用容器编排框架(kubernetes),合理,规范,专业的编排业务容器
3.课程大纲
4.Prometheus(CNCF云原生基金会,里面有k8s,prometheus项目)
google的前员工写的
官网https://Prometheus.io
源码https://github.com/prometheus/prometheus
5.Prometheus的特点
- 多维数据模型: 由度量名称和键值对标识的时间序列数据(类似hbase中的多列)
- 内置时间序列数据库: TSDB,时间序列数据库(hbase,mongdb),关系型数据库mysql最优秀的就是支持强事务
- promQL: 一种灵活的查询语言,可以利用多维数据完成复杂查询(不同的查询检索和过滤条件,去查到监控的指标,查到的是度量名)
这个还是很难的.不同于以往认知的结构化查询语言sql
一系列查询条件,过滤出的结果,并不是一个数值,而是一个函数
- 基于http的pull方法采集时间序列数据(exporter)
prometheus也是客户端,服务端2个结构,客户端代理端叫做exporter(拉数据)
exporter和zabbix agent有什么区别?
exporter只能提供http接口,对于mysql和redis要下专门的exporter,exporter里面进行配置去连mysql和redis,exporter自己再暴露http接口
监控容器云,要用到4个exporter
- 同时支持PushGateway组件收集数据
特定的场景
转瞬即逝的任务,这种就需要自己主动推数据,就用到了PushGateWay
生产中用的极少
- 通过服务发现或静态配置发现目标
静态配置容易一些,占课程3%
服务发现,自动发现占课程97
- 多种图形模式及仪表盘支持
- 支持作为数据源接入Grafana
6.Prometheus架构
k8s_sd:基于k8s的原数据,做服务发现,课程重中之重
file_sd: 应用不在k8s里,通过文件来实现
- Retrieval : 收集数据中心
- TSDB: 时间数据库,落SSD里
- HTTP server: 后台自己提供了一个http server(1是可以给grafana提供数据查询的接口2是可以给Alertmanager提供数据支持)
7.Prometheus和zabbix对比
监控容器,zabbix 的自动发现搞不定容器的监控,k8s里面更新代码后,删一个pod,拉新的pod起来,这个动作,zabbix会认为老机器down机报警,新机器新加进来的,我要监控
8.kube-state-metrics这个exporter
8.1 说明
kube-state-metrics,这个是用来收集K8S集群基本状态信息的监控代理
k8s几个节点,节点有几个deployment,deployment更新过几版,由kube-state-metrics
8.2 下载镜像
代码托管在quay.io的容器仓库里,redhat提供的,类似于docker.io
# 运维主机做镜像,推镜像
[root@jdss7-200]# docker pull quay.io/coreos/kube-state-metrics:v1.5.0
[root@jdss7-200]# docker images | grep kube-state
[root@jdss7-200]# docker tag 91599517197a harbor.od.com/public/kube-state-metrics:v1.5.0
[root@jdss7-200]# cd /data/k8s-yaml/
[root@jdss7-200]# mkdir -pv kube-state-metrics
[root@jdss7-200]#
8.3 准备资源配置清单rbac.yaml及dp.yaml
rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/cluster-service: "true"
name: kube-state-metrics
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/cluster-service: "true"
name: kube-state-metrics
rules:
- apiGroups:
- ""
resources:
- configmaps
- secrets
- nodes
- pods
- services
- resourcequotas
- replicationcontrollers
- limitranges
- persistentvolumeclaims
- persistentvolumes
- namespaces
- endpoints
verbs:
- list
- watch
- apiGroups:
- policy
resources:
- poddisruptionbudgets
verbs:
- list
- watch
- apiGroups:
- extensions
resources:
- daemonsets
- deployments
- replicasets
verbs:
- list
- watch
- apiGroups:
- apps
resources:
- statefulsets
verbs:
- list
- watch
- apiGroups:
- batch
resources:
- cronjobs
- jobs
verbs:
- list
- watch
- apiGroups:
- autoscaling
resources:
- horizontalpodautoscalers
verbs:
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/cluster-service: "true"
name: kube-state-metrics
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kube-state-metrics
subjects:
- kind: ServiceAccount
name: kube-state-metrics
namespace: kube-system
dp.yaml
里面有一个readlinessProbe就绪性探针的东西,容器满足就绪性探针条件的时候,判定容器才是启动,如果判定条件不满足,不会调度流量的
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "2"
labels:
grafanak8sapp: "true"
app: kube-state-metrics
name: kube-state-metrics
namespace: kube-system
spec:
selector:
matchLabels:
grafanak8sapp: "true"
app: kube-state-metrics
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
labels:
grafanak8sapp: "true"
app: kube-state-metrics
spec:
containers:
- name: kube-state-metrics
image: harbor.od.com/public/kube-state-metrics:v1.5.0
imagePullPolicy: Always
ports:
- containerPort: 8080
name: http-metrics
protocol: TCP
readinessProbe: # 就绪性探针
failureThreshold: 3
httpGet:
path: /healthz
port: 8080
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
serviceAccountName: kube-state-metrics
8.4 应用资源配置清单
kubectl apply -f http://k8s-yaml.od.com/kube-state-metrics/rbac.yaml
kubectl apply -f http://k8s-yaml.od.com/kube-state-metrics/dp.yaml
8.5 验证
在kube-system名称空间
[root@jdss7-21 ~]# kubectl get pods -n kube-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
coredns-6b6c4f9648-zszt9 1/1 Running 1 83d 172.7.22.4 jdss7-22.host.com <none> <none>
kube-state-metrics-585576bc4-bsb9t 1/1 Running 0 3m4s 172.7.22.9 jdss7-22.host.com <none> <none>
kubernetes-dashboard-76dcdb4677-t7m7t 1/1 Running 1 64d 172.7.22.5 jdss7-22.host.com <none> <none>
traefik-ingress-8bk4l 1/1 Running 1 81d 172.7.21.2 jdss7-21.host.com <none> <none>
traefik-ingress-bzmw6 1/1 Running 1 81d 172.7.22.3 jdss7-22.host.com <none> <none>
[root@jdss7-21 ~]# curl 'http://172.7.22.9:8080/healthz'
ok[root@jdss7-21 ~]curl 'http://172.7.22.9:8080/metrics'
9.node-exporter
9.1 准备镜像
收集k8s运算节点,基础设施信息的,cpu使用率,内存使用率,磁盘使用率,网络io
还是用0.15.0这个版本
[root@jdss7-200]# docker pull prom/node-exporter:v0.15.0
[root@jdss7-200]# docker images |grep node-exporter
[root@jdss7-200]# docker tag 12d51ffa2b22 harbor.od.com/public/node-exporter:v0.15.0
[root@jdss7-200]# docker push harbor.od.com/public/node-exporter:v0.15.0
[root@jdss7-200]# cd /data/k8s-yaml/
[root@jdss7-200]# mkdir -pv node-exporter
9.2 准备资源配置清单
ds.yaml
kind: DaemonSet
apiVersion: extensions/v1beta1
metadata:
name: node-exporter
namespace: kube-system
labels:
daemon: "node-exporter"
grafanak8sapp: "true"
spec:
selector:
matchLabels:
daemon: "node-exporter"
grafanak8sapp: "true"
template:
metadata:
name: node-exporter
labels:
daemon: "node-exporter"
grafanak8sapp: "true"
spec:
volumes:
- name: proc
hostPath:
path: /proc
type: ""
- name: sys
hostPath:
path: /sys
type: ""
containers:
- name: node-exporter
image: harbor.od.com/public/node-exporter:v0.15.0
imagePullPolicy: Always
args:
- --path.procfs=/host_proc
- --path.sysfs=/host_sys
ports:
- name: node-exporter
hostPort: 9100
containerPort: 9100
protocol: TCP
volumeMounts:
- name: sys
readOnly: true
mountPath: /host_sys
- name: proc
readOnly: true
mountPath: /host_proc
hostNetwork: true # 用了宿主机的网络名称空间
9.3 应用资源配置清单
kubectl apply -f http://k8s-yaml.od.com/node-exporter/ds.yaml
9.4 验证
curl http://localhost:9100/metrics
10.cadvisor
监控我们容器内部使用资源的重要exporter工具
从外部探测,容器消耗了多少cpu,多少资源,原本是和kubelet集成到一起的,1.9版本以后拆开了
cadvisor去和kubelet要数据,kubelete要和docker引擎去要数据
使用版本0.28.3
10.1 下载镜像
k8s是1.9版本以前的,cadvisor自己就配套起了,不用额外再配置
[root@jdss7-200]# docker pull google/cadvisor:v0.28.3
[root@jdss7-200]# docker images | grep cadvisor
[root@jdss7-200]# docker tag 75f00e3ec333 harbor.od.com/public/cadvisor:v0.28.3
[root@jdss7-200]# docker push harbor.od.com/public/cadvisor:v0.28.3
[root@jdss7-200]# cd /data/k8s-yaml/
[root@jdss7-200]# mkdir -p cadvisor
10.2 人为影响K8S调度策略的三种方法
- 污点,容忍度方法
- 污点: 运算节点node上的污点
kubectl taint node jdss7-21.host.com node-role.kubernetes.io/master=master:NoSchedule kubectl get nodes -o wide # 删污点 kubectl taint node jdss7-21.host.com node-role.kubernetes.io/master-
- 容忍度: pod是否能够容忍污点
- nodeName: 让Pod运行在指定的node上
- nodeSelector: 标签选择器,让pod运行在指定的一类node上
10.3 创建资源配置清单
ds.yaml
kind: DaemonSet
apiVersion: apps/v1
metadata:
name: cadvisor
namespace: kube-system
labels:
app: cadvisor
spec:
selector:
matchLabels:
name: cadvisor
template:
metadata:
labels:
name: cadvisor
spec:
hostNetwork: true # 用了宿主机的名称空间
tolerations: # 容忍污点(用户本身人为影响k8s调度的重要方法,可以干预kube-schedule)
- key: node-role.kubernetes.io/master # 发现是主控节点
effect: NoSchedule # 就不调度
containers:
- name: cadvisor
image: harbor.od.com/public/cadvisor:v0.28.3
imagePullPolicy: Always
volumeMounts:
- name: rootfs
readOnly: true
mountPath: /rootfs
- name: var-run
mountPath: /var/run
- name: sys
mountPath: /sys
readOnly: true
- name: docker
mountPath: /var/lib/docker
readOnly: true
ports:
- name: http
containerPort: 4194
protocol: TCP
readinessProbe:
tcpSocket:
port: 4194
initialDelaySeconds: 5
periodSeconds: 10
args:
- --housekeeping_interval=10s
- --port=4194
terminationGracePeriodSeconds: 30
volumes:
- name: rootfs
hostPath:
path: /
- name: var-run
hostPath:
path: /var/run
- name: sys
hostPath:
path: /sys
- name: docker
hostPath:
path: /data/docker
10.4 修改运算节点的软连接
mount -o remount,rw /sys/fs/cgroup/
ln -s /sys/fs/cgroup/cpu,cpuacct /sys/fs/cgroup/cpuacct,cpu # 容器里面会用到这个
最后效果
10.5 加载资源配置清单
kubectl apply -f http://k8s-yaml.od.com/cadvisor/ds.yaml
kubectl get pods -n kube-system
10.6 起着的端口是4194,cadvisor
11.blackbox-exporter
非常重要的作用,帮你探明你业务容器是否存活,容器是否down机的工具
11.1 准备镜像
[root@jdss7-200]# docker pull prom/blackbox-exporter:v0.15.1
[root@jdss7-200]# docker images | grep blackbox
[root@jdss7-200]# docker tag 81b70b6158be harbor.od.com/public/blackbox-exporter:v0.15.1
[root@jdss7-200]# docker push harbor.od.com/public/blackbox-exporter:v0.15.1
11.2 准备资源配置清单
cd /data/k8s-yaml/blackbox-exporter
cm.yaml
apiVersion: v1
kind: ConfigMap
metadata:
labels:
app: blackbox-exporter
name: blackbox-exporter
namespace: kube-system
data:
blackbox.yaml: |-
modules:
http_2xx:
prober: http
timeout: 2s
http:
valid_http_versions: ["HTTP/1.1","HTTP/2"]
valid_status_codes: [200,301,302]
method: GET
preferred_ip_protocol: "ip4"
tcp_connect:
prober: tcp
timeout: 2s
dp.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: 1
labels:
app: blackbox-exporter
name: blackbox-exporter
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: blackbox-exporter
template:
metadata:
labels:
app: blackbox-exporter
spec:
volumes:
- name: config
configMap:
name: blackbox-exporter
defaultMode: 420
containers:
- name: blackbox-exporter
image: harbor.od.com/public/blackbox-exporter:v0.15.1
imagePullPolicy: Always
args:
- --config.file=/etc/blackbox_exporter/blackbox.yaml
- --log.level=info
- --web.listen-address=:9115
ports:
- containerPort: 9115
name: blackbox-port
protocol: TCP
resources:
limits:
cpu: 200m
memory: 256Mi
requests:
cpu: 100m
memory: 50Mi
volumeMounts:
- name: config
mountPath: /etc/blackbox_exporter
readinessProbe:
tcpSocket:
port: 9115
initialDelaySeconds: 5
timeoutSeconds: 5
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
svc.yaml
kind: Service
apiVersion: v1
metadata:
name: blackbox-exporter
namespace: kube-system
spec:
selector:
app: blackbox-exporter
ports:
- name: blackbox-port
protocol: TCP
port: 9115
ingress.yaml
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: blackbox-exporter
namespace: kube-system
spec:
rules:
- host: blackbox.od.com
http:
paths:
- path: /
backend:
serviceName: blackbox-exporter
servicePort: blackbox-port
11.3 应用资源配置清单
kubectl apply -f http://k8s-yaml.od.com/blackbox-exporter/cm.yaml
kubectl apply -f http://k8s-yaml.od.com/blackbox-exporter/dp.yaml
kubectl apply -f http://k8s-yaml.od.com/blackbox-exporter/svc.yaml
kubectl apply -f http://k8s-yaml.od.com/blackbox-exporter/ingress.yaml
12.污点容忍度调度
就不想让这个pod调度到jdss7-21机器上
获取deployment
kubectl get deploy nginx-dp -o yaml -n kube-public > nginx-dp.yaml
说明:
dp.yaml中要注意的配置项
dnsPolicy: ClusterFirst // 代表走coreDns
dnsPllicy: Default // 代表走宿主机的/etc/resolv.conf
12.1 给7-21的机器上加一个污点,还能否调度到7-21机器上
kubectl taint node jdss7-21.host.com quedian=buxijiao:NoSchedule
# 打了一个污点,key是quedian,value是buxiaojiao,遇到这个污点的调度策略是不调度NoSchedule
kubectl describe node jdss7-21.host.com
在dashboard上执行扩容
发现还是只调度到jdss7-22的机器上了
12.2 tolerations配置项可以容忍污点,接受你污点,即使你不洗脚,我也可以接受,我也要调度
tolerations翻译过来就是容忍
deployment.yaml
template:
spec:
tolerations:
- key: quedian
value: buxiaojiao
effect: NoSchedule
12.3 给7-21的机器上删除名为quedian的污点
kubectl taint node jdss7-21.host.com quedian-
12.4 给7-21的机器上添加名为quedian的污点,调度策略为NoExecute
NoExecute:代表你可以调度到我这里,但是不允许把pod拉起来
template:
spec:
tolerations:
- key: quedian
value: buxiaojiao
effect: NoSchedule
12.5 Prometheus的正确玩法
给prometheus单独开一个运算节点,这个运算节点只能跑prometheus
这个运算节点,打个污点,只有prometheus可以容忍污点,
所以prometheus就可以单独跑到这个运算节点上了
12.6 污点引出标签,及标签选择器的使用场景
给运算节点打标签,A机器的disktype=ssd,B机器的disktype=sas
io密集的应用,人为调度加nodeSelector,加标签选择,就尽量往ssd类型的机器上调度
12.7 说明
污点和容忍度用的还是比较少的,也就prometheus用,别的场景基本不会用
12.8 NoExecute
节点要下线维修,可以给jdss7-21的机器上加一个污点,策略是NoExecute,将上面起的pod机器都排干净,保持没流量了
然后k8s集群中再摘了这个jdss7-21机器,就可以下线维修了
或者发现应用调度到了非ssd的机器上,性能不好
12.9 NoSchedule
是新建的时候不调度,已有在运行的,不会受此影响
12.10 给node加多个污点
kubectl taint node jdss7-21.host.com qu=buxiaojiao:NoSchedule
kubectl taint node jdss7-21.host.com buxiaozao=:NoSchedule // 只写key名字buxiaozao,不写value
13.安装部署prometheus-server
13.1 准备镜像
[root@jdss7-200]# docker pull prom/prometheus:v2.14.0
[root@jdss7-200]# docker images | grep prometheus
[root@jdss7-200]# docker tag 7317640d55e harbor.od.com/infra/prometheus:v2.14.0
[root@jdss7-200]# docker push harbor.od.com/infra/prometheus:v2.14.0
13.2 准备资源配置清单
mkdir -p /data/k8s-yaml/prometheus
rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/cluster-service: "true"
name: prometheus
namespace: infra
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/cluster-service: "true"
name: prometheus
rules:
- apiGroups:
- ""
resources:
- nodes
- nodes/metrics
- services
- endpoints
- pods
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
verbs:
- get
- nonResourceURLs:
- /metrics
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/cluster-service: "true"
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: infra
dp.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "5"
labels:
name: prometheus
name: prometheus
namespace: infra
spec:
progressDeadlineSeconds: 600
replicas: 1 # 只能起一份,因为有持久化数据tsdb
revisionHistoryLimit: 7
selector:
matchLabels:
app: prometheus
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
labels:
app: prometheus
spec:
nodeName: jdss7-21.host.com # 人为介入,让跑到jdss7-21机器上
containers:
- name: prometheus
image: harbor.od.com/infra/prometheus:v2.14.0
imagePullPolicy: Always
command:
- /bin/prometheus
args:
- --config.file=/data/etc/prometheus.yaml
- --storage.tsdb.path=/data/prom-db
- --storage.tsdb.retention=72h # tsdb里到底要存多久的数据
- --storage.tsdb.min-block-duration=10m # 只加载10分钟内的数据到内存
ports:
- containerPort: 9090
protocol: TCP
volumeMounts:
- mountPath: /data
name: data
resources:
requests: # 容器一起来就要申请多少资源
cpu: "1000m" # 1000豪核,就是1核
memory: "1.5Gi"
limits: # 当容器资源达到了limits量的时候,k8s就把你杀死,容器oom
cpu: "2000m"
memory: "3Gi"
imagePullSecrets:
- name: harbor
securityContext:
runAsUser: 0
serviceAccountName: prometheus
volumes:
- name: data
nfs:
server: jdss7-200
path: /data/nfs-volume/prometheus
svc.yaml
kind: Service
apiVersion: v1
metadata:
name: prometheus
namespace: infra
spec:
selector:
app: prometheus
ports:
- protocol: TCP
port: 9090
targetPort: 9090
Ingress.yaml
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
annotations:
kubernetes.io/ingress.class: traefik
name: prometheus
namespace: infra
spec:
rules:
- host: prometheus.od.com
http:
paths:
- path: /
backend:
serviceName: prometheus
servicePort: 9090
13.3 准备nfs目录
[root@jdss7-200]# cd /data/nfs-volume/
[root@jdss7-200]# mkdir -p prometheus
[root@jdss7-200]# mkdir -p prometheus/{etc,prom-db}
13.4 拷贝证书
# prometheus 要和apiServer通信,所以依赖证书
[root@jdss7-200]# cp /opt/certs/ca.pem /data/nfs-volume/prometheus/etc/
[root@jdss7-200]# cp /opt/certs/client.pem /data/nfs-volume/prometheus/etc/
[root@jdss7-200]# cp /opt/certs/client-key.pem /data/nfs-volume/prometheus/etc/ #client-key.pem 私钥,权限一定要是600
13.5 准备prometheus的配置文件
prometheus.yaml
global:
scrape_interval: 15s # 多久收集一次数据
scrape_timeout: 10s # 每次收集数据的超时时间
evaluation_interval: 15s # 多久评估一次规则
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
scrape_configs:
- job_name: etcd
honor_timestamps: true
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
static_configs: # 只有etcd用的静态配置,别的都是自动发现
- targets:
- 10.4.7.12:2379
- 10.4.7.21:2379
- 10.4.7.22:2379
tls_config:
ca_file: /data/etc/ca.pem
cert_file: /data/etc/client.pem
key_file: /data/etc/client-key.pem
insecure_skip_verify: false
- job_name: 'kubernetes-apiververs'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace,__meta_kubernetes_service_name,__meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.*)
- source_labels: [__address__,__meta_kubernetes_pid_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- job_name: 'kubernetes-kubelete'
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __address__
replacement: ${1}:10255
- job_name: 'kubernetes-cadvisor'
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __address__
replacement: ${1}:4194
- job_name: kubernetes-kube-state
honor_timestamps: true
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
kubernetes_sd_configs: #k8s service discovery configs 服务发现配置
- role: pod # 能够发现k8s集群里面所有的pod
relabel_configs: # 不能让所有的pod都进入我的job里来,我只能过滤一些进到我的job里来
- separator: ;
regex: __meta_kubernetes_pod_lebel_(.*)
replacement: $1
action: labelmap
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: kubernetes_namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_name]
separator: ;
regex: (.*)
target_label: kubernetes_pod_name
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_label_grafanak8sapp] # 元数据pod的一个标签,叫做grafanak8sapp的,匹配到,执行action动作
separator: ;
regex: ."true."
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_pod_label_daemon,__meta_kubernetes_pod_node_name]
separator: ;
regex: node-exporter;(.*)
target_label: nodename
replacement: $1
action: replace
- job_name: 'blackbox_http_pod_probe'
metrics_path: /probe
kubernetes_sd_configs:
- role: pod
params:
module: [http_2xx]
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_blackbox_scheme]
action: keep
regex: http
- source_labels: [__address__,__meta_kubernetes_pod_annotation_blackbox_port,__meta_kubernetes_pod_annotation_blackbox_path]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+);(.+)
replacement: $1:$2$3
target_label: __param_target
- action: replace
target_label: __address__
replacement: blackbox-exporter.kube-system:9115
- source_labels: [__param_target]
target_label: instance
- action: labelmap
regex: __meta_kubernetes_pod_label_(.*)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- job_name: 'blackbox_tcp_pod_probe'
metrics_path: /probe
kubernetes_sd_configs:
- role: pod
params:
module: [tcp_connect]
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_blackbox_scheme]
action: keep
regex: tcp
- source_labels: [__address__,__meta_kubernetes_pod_annotation_blackbox_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __param_target
- action: replace
target_label: __address__
replacement: blackbox-exporter.kube-system:9115
- source_labels: [__param_target]
target_label: instance
- action: labelmap
regex: __meta_kubernetes_pod_label_(.*)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- job_name: traefik
honor_timestamps: true
scrape_interval: 15s
scrape_timeout: 10s
scheme: http
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
separator: ;
regex: traefik
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
separator: ;
regex: (.+)
target_label: __metrics_path__
replacement: $1
action: replace
- source_labels: [__address__,__meta_kubernetes_pod_annotation_prometheus_io_port]
separator: ;
regex: ([^:]+)(?::\d+)?;(\d+)0
target_label: __address__
replacement: $1:$2
action: replace
- separator:
regex: __meta_kubernetes_pod_label_(.*)
replacement: $1
action: labelmap
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: kubernetes_namespace
13.6 加载资源配置清单
kubectl apply -f http://k8s-yaml.od.com/prometheus/rbac.yaml
kubectl apply -f http://k8s-yaml.od.com/prometheus/dp.yaml
kubectl apply -f http://k8s-yaml.od.com/prometheus/svc.yaml
kubectl apply -f http://k8s-yaml.od.com/prometheus/ingress.yaml
13.7 验证
prometheus.od.com
看看status的标签页,Congiguration就是配置文件,
Targets
label_configs里面的action的意义
13.8 进行检索查询
node_cpu{nodename="jdss7-21.host.com"}
14.用Prometheus监控traefik
14.1 prometheus.yaml上面有一个traefik的jobName
- job_name: traefik
honor_timestamps: true
scrape_interval: 15s
scrape_timeout: 10s
scheme: http
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme] # pod注解里有prometheus_io_scheme的就监控
separator: ;
regex: traefik
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
separator: ;
regex: (.+)
target_label: __metrics_path__
replacement: $1
action: replace
- source_labels: [__address__,__meta_kubernetes_pod_annotation_prometheus_io_port]
separator: ;
regex: ([^:]+)(?::\d+)?;(\d+)0
target_label: __address__
replacement: $1:$2
action: replace
- separator:
regex: __meta_kubernetes_pod_label_(.*)
replacement: $1
action: labelmap
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: kubernetes_namespace
14.2 traefik的pod控制器deployment添加注解说明
"annotations":{
"prometheus_io_scheme":"traefik",
"prometheus_io_path":"/metrics",
"prometheus_io_port":"8080"
}
路径在和labels同级,template层级里
15.blackbox配套prometheus
15.1 jobName里blackbox_tcp_pod_probe
blackbox里依赖的pod控制器deployment添加
通过检测容器的端口,来判断容器的存活
"annotations":{
"blackbox_port":"20880",
"blackbox_scheme":"tcp"
}
15.2 jobName里blackbox_http_pod_probe
blackbox_http_pod_probe
blackbox里依赖的pod控制器deployment添加
通过检测容器的端口,来判断容器的存活
"annotations":{
"blackbox_port":"8080",
"blackbox_scheme":"http",
"blackbox_path":"/"
}
16.用prometheus监控dubbo-demo-service,dubbo服务提供者的存活性
在dubbo-demo-service的pod控制器deployment.yaml中添加注解
路径在和labels同级,template层级里
"annotations":{
"blackbox_port":"20880",
"blackbox_scheme":"tcp"
}
然后prometheus的blackbox_tcp_pod_probe的jobName里就有数了
blackbox.od.com
blackbox-exporter.kube-system是k8s集群里面的service的名字
17.用prometheus监控jvm信息
kubernetes-pods
核心注解
"annotations":{
"prometheus_io_scrape" : "true",
"prometheus_io_port" : "12346",
"prometheus_io_path": "/"
}
18.部署grafana
18.1 下载镜像5.4.7
[root@jdss7-200]# docker pull grafana/grafana:5.4.2
[root@jdss7-200]# docker images | grep grafana
[root@jdss7-200]# docker tag 6f18ddf9e552 harbor.od.com/infra/grafana:v5.4.2
[root@jdss7-200]# docker push harbor.od.com/infra/grafana:v5.4.2
18.2 准备资源配置清单
rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/cluster-service: "true"
name: grafana
rules:
- apiGroups:
- "*"
resources:
- namespaces
- deployments
- pods
verbs:
- get
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/cluster-service: "true"
name: grafana
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: grafana
subjects:
- kind: User
name: k8s-node
dp.yaml
kind: Deployment
apiVersion: extensions/v1beta1
metadata:
name: grafana
namespace: infra
labels:
app: grafana
name: grafana
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 7
selector:
matchLabels:
name: grafana
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
template:
metadata:
labels:
app: grafana
name: grafana
spec:
containers:
- name: grafana
image: harbor.od.com/infra/grafana:v5.4.2
imagePullPolicy: Always
ports:
- containerPort: 3000
protocol: TCP
volumeMounts:
- name: data
mountPath: /var/lib/grafana
imagePullSecrets:
- name: harbor
securityContext:
runAsUser: 0
volumes:
- nfs:
server: jdss7-200
path: /data/nfs-volume/grafana
name: data
service.yaml
kind: Service
apiVersion: v1
metadata:
name: grafana
namespace: infra
spec:
selector:
app: grafana
ports:
- protocol: TCP
port: 3000
targetPort: 3000
ingress.yaml
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: grafana
namespace: infra
spec:
rules:
- host: grafana.od.com
http:
paths:
- path: /
backend:
serviceName: grafana
servicePort: 3000
18.3 应用资源配置清单
kubectl apply -f http://k8s-yaml.od.com/grafana/rbac.yaml
kubectl apply -f http://k8s-yaml.od.com/grafana/dp.yaml
kubectl apply -f http://k8s-yaml.od.com/grafana/svc.yaml
kubectl apply -f http://k8s-yaml.od.com/grafana/ingress.yaml
18.4 测试
grafana.od.com
账户admin密码admin
改成admin123和admin123
18.5 配置
- Setting->Preferences
OrganizationName改为OldboyEdu
UI Theme 改为Light风格
Home Dashboard 还是Default
Timezone改为Local browser time
- Configuration -> Plugins
安装Kubernetes App 插件
grafana-cli plugins install grafana-kubernetes-app
grafana-cli plugins install grafana-clock-panel
grafana-cli plugins install grafana-piechart-panel
grafana-cli plugins install briangann-gauge-panel
grafana-cli plugins install natel-discrete-panel
-
重启grafana容器,使得插件生效
-
Configuration-> Add datasource
按照图例添加datasource
需要将CA Cert(7-200机器上cat /opt/certs/ca.pem)
和Client Cert (7-200机器cat /opt/certs/client.pem)
和Client Key(7-200机器cat /opt/certs/client-key.pem)
Basic Auth去掉,不勾选
- 配置Plugins里面的kubernetes
enable,左侧多了一个图标
new Cluster
里面的http地址写的是apiServer的地址
里面key的内容再贴一遍,datasource选prometheus,save一下
overview里就有dashboard了
- Configuration里面的datasource要test一下,不然grafana不去连datasource
- 这样dashboard就出来了
Container是最有用的,但是需要额外修改一个参数,Container的dashboard
- 可以去grafana的官网,grafana.com/grafana/dashboard下载个星最多的,里面有编号
- import dashboard
11074 load ,就用了别人的dashboard
19.AlertManager
19.1 先拉镜像
docker pull docker.io/prom/alertmanager:v0.14.0
docker images | grep alert
docker tag 30954e96cbe8 harbor.od.com/infra/alertmanager:v0.14.0
docker push harbor.od.com/infra/alertmanager:v0.14.0
19.2 准备资源配置清单
cm.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: infra
data:
config.yaml: |-
global:
# 在没有报价的情况下声明为已解决的时间
resoev_timeout: 5m
# 配置邮件发送信息
smtp_xmarthost: 'xmtp.163.com:25'
smtp_from: 'wa2319@163.com'
smtp_auth_username: 'ws2319@163.com'
smtp_auth_password: 'xxxxx'
smtp_require_tls: false
# 所有报价信息进入后的根路由,用来设置报价的分发策略
route:
# 这里的标签列表是接受报警信息后的重新分组标签,例如,接收的报警信息里有许多具有cluster=A和alertname=LatncyHigh这样的标签的报警信息
group_by: ['alertname','cluster']
# 当一个新的报警分组被创建后,需要等待至少group_wait时间来初始化通知,这种方式可以确保您有足够的时间为同一分组获取多个警报,然后一起触发报警
group_wait: 30s
# 当一个报警发送后,等待group_interval时间来发送新的一组报警信息
group_interval: 5m
# 如果一个报警信息以及发送成功了,等待repeat_interval时间来重新发送他们
repeat_interval: 5m
# 默认的receiver:如果一个报警没有被一个route匹配,则发送给默认的接收器
receiver: default
receivers:
- name: 'default'
email_configs:
- to: '87527941@qq.com'
dp.yaml
kind: Deployment
apiVersion: extensions/v1beta1
metadata:
name: alertmanager
namespace: infra
spec:
replicas: 1
selector:
matchLabels:
app: alertmanager
template:
metadata:
labels:
app: alertmanager
spec:
containers:
- name: alertmanager
image: harbor.od.com/infra/alertmanager:v0.14.0
args:
- --config.file=/etc/alertmanager/config.yml
- --storage.path=/alertmanager
imagePullPolicy: Always
ports:
- containerPort: 9090
name: alertmanager
volumeMounts:
- name: alertmanager-cm
mountPath: /etc/alertmanager
imagePullSecrets:
- name: harbor
volumes:
- name: alertmanager-cm
configMap: jdss7-200
name: alertmanager-config
svc.yaml
kind: Service
apiVersion: v1
metadata:
name: alertmanager
namespace: infra
spec:
selector:
app: alertmanager
ports:
- port: 80
targetPort: 9093
19.3 应用资源配置清单
kubectl apply -f http://k8s-yaml.od.com/alertmanager/cm.yaml
kubectl apply -f http://k8s-yaml.od.com/alertmanager/dp.yaml
kubectl apply -f http://k8s-yaml.od.com/alertmanager/svc.yaml
19.4 prometheus的报警规则
vim /data/nfs-volume/prometheus/etc/rules.yml
ALERT goroutines_gt_70
IF go_goroutines > 70
FOR 5s
LABELS { status = "yellow" }
ANNOTATIONS {
summary = "goroutines 超过 70,当前值{{ $value }}",
description = "当前实例 {{ $labels.instance }}",
}
ALERT goroutines_gt_90
IF go_goroutines > 90
FOR 5s
LABELS { status = "red" }
ANNOTATIONS {
summary = "goroutines 超过 90,当前值{{ $value }}",
description = "当前实例 {{ $labels.instance }}",
}
19.5 添加prometheus的配置
vim /data/nfs-volume/prometheus/etc/prometheus.yml
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager"] # 这是serviceName
rule_files:
- "/data/etc/rules.yml"
然后重启prometheus,不要删pod,强制调度到7-21机器上时,
kill -SITHUP 24705 // prometheus的pid
19.6 测试告警
停掉consumer