kspan 集群度量方案
非原创,参考文章如下,相对下列文章信息,操作和说明更加贴近日常工作:
- https://mp.weixin.qq.com/s/8A8YDAQd67YACnbZiN6Q5g
- https://felipecruz.es/visualizing-kubernetes-events-with-kspan/
背景
作为集群管理员,当我们管理的集群数量众多时,或者pod从创建到启动的过程,需要经理的过程,以及耗时,可以分析出我们的集群慢在哪里。
在没有可视化工具之前,我们可以通过查看event事件,确定每个步骤的耗时,如下:
$ kubectl create deploy nginx --image=nginx
deployment.apps/nginx created
$ kubectl get event
LAST SEEN TYPE REASON OBJECT MESSAGE
7s Normal Scheduled pod/nginx-f89759699-whcxz Successfully assigned default/nginx-f89759699-whcxz to hd-k8s-master003
7s Normal Pulling pod/nginx-f89759699-whcxz Pulling image "nginx"
7s Normal SuccessfulCreate replicaset/nginx-f89759699 Created pod: nginx-f89759699-whcxz
7s Normal ScalingReplicaSet deployment/nginx Scaled up replica set nginx-f89759699 to 1
我们可以查看到Pod从调度,pull ,create,start的全部过程,以及大致的时间消耗。
更优雅的方案
K8S 中的这些事件,都对应着我们的一个操作,比如上文中是创建了一个 deployment ,它产生了几个 event , 包括 Scheduled , Pulled ,Created 等。我们将其进行抽象,是不是和我们做的链路追踪(tracing)很像呢?
这里我们会用到一个 CNCF 的毕业项目 Jaeger[1] ,在之前的 K8S生态周报 中我有多次介绍它,Jaeger 是一款开源的,端对端的分布式 tracing 系统。不过本文重点不是介绍它,所以我们查看其文档,快速的部署一个 Jaeger 即可。另一个 CNCF 的 sandbox 级别的项目是 OpenTelemetry[2] 是一个云原生软件的可观测框架,我们可以把它跟 Jaeger 结合起来使用。不过本文的重点不是介绍这俩项目,这里暂且略过。
接下来介绍我们这篇文章的用到的主要项目,是来自 Weaveworks 开源的一个项目,名叫 kspan ,它的主要做法就是将 K8S 中的 event 作为 trace 系统中的 span 进行组织。
部署kspan
创建rbac授权,因为kspan要监听event
相关信息
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: kspan
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kspan-admin
rules:
- apiGroups:
- ""
resources:
- configmaps
- endpoints
- persistentvolumeclaims
- persistentvolumeclaims/status
- pods
- replicationcontrollers
- replicationcontrollers/scale
- serviceaccounts
- services
- services/status
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- bindings
- events
- limitranges
- namespaces/status
- pods/log
- pods/status
- replicationcontrollers/status
- resourcequotas
- resourcequotas/status
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- pods/exec
verbs:
- create
- apiGroups:
- ""
resources:
- namespaces
verbs:
- get
- list
- watch
- apiGroups:
- apps
resources:
- controllerrevisions
- daemonsets
- daemonsets/status
- deployments
- deployments/scale
- deployments/status
- replicasets
- replicasets/scale
- replicasets/status
- statefulsets
- statefulsets/scale
- statefulsets/status
verbs:
- get
- list
- watch
- apiGroups:
- autoscaling
resources:
- horizontalpodautoscalers
- horizontalpodautoscalers/status
verbs:
- get
- list
- watch
- apiGroups:
- batch
resources:
- cronjobs
- cronjobs/status
- jobs
- jobs/status
verbs:
- get
- list
- watch
- apiGroups:
- extensions
resources:
- daemonsets
- daemonsets/status
- deployments
- deployments/scale
- deployments/status
- ingresses
- ingresses/status
- networkpolicies
- replicasets
- replicasets/scale
- replicasets/status
- replicationcontrollers/scale
verbs:
- get
- list
- watch
- apiGroups:
- policy
resources:
- poddisruptionbudgets
- poddisruptionbudgets/status
verbs:
- get
- list
- watch
- apiGroups:
- networking.k8s.io
resources:
- ingresses
- ingresses/status
- networkpolicies
verbs:
- get
- list
- watch
- apiGroups:
- metrics.k8s.io
resources:
- pods
- nodes
verbs:
- get
- list
- watch
- apiGroups:
- metrics.k8s.io
resources:
- pods
verbs:
- get
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
creationTimestamp: null
name: kspan-admin
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kspan-admin
subjects:
- kind: ServiceAccount
name: kspan
namespace: default
创建pod
apiVersion: v1
kind: Pod
metadata:
labels:
run: kspan
name: kspan
spec:
containers:
- image: docker.io/weaveworks/kspan:v0.0
name: kspan
resources: {}
dnsPolicy: ClusterFirst
restartPolicy: Always
serviceAccountName: kspan
部署jagger
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: jaeger
name: jaeger
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
strategy: {}
template:
metadata:
labels:
app: jaeger
spec:
containers:
- image: jaegertracing/opentelemetry-all-in-one
name: opentelemetry-all-in-one
resources: {}
ports:
- containerPort: 16685
- containerPort: 16686
- containerPort: 5775
protocol: UDP
- containerPort: 6831
protocol: UDP
- containerPort: 6832
protocol: UDP
- containerPort: 5778
protocol: TCP
创建jagger svc,它默认会使用 otlp-collector.default:55680 传递 span
apiVersion: v1
kind: Service
metadata:
labels:
app: jaeger
name: otlp-collector
spec:
ports:
- port: 55680
protocol: TCP
targetPort: 55680
selector:
app: jaeger
当所有的Pod都启动成功后,我们可以进行访问测试
效果
创建ns以及Pod
$ kubectl create ns moelove
namespace/moelove created
$ kubectl -n moelove create deploy nginx --image=nginx
deployment.apps/nginx created
查看jaeger ui,查看信息
结论
目前kspan的开源地址并没有提供定制化部署的方案,或者我没有找到详细的文档,所以不建议将kspan作为kubernetes的常用组件进行部署,当有需求再进行部署,查看任务下发的耗时,找到瓶颈即可。
如果你是多租户场景,需要针对调度慢等情况做告警,可以研究OpenTelemetry
。