「Bug」删除 replicaset 时,Pod 不会被级联删除
2020-07-17
问题描述
Kubernetes 的级联管理功能失效:
- 删除 Replicaset 时,Pod 不会被级联删除
- 删除 Cronjob 时不会级联删除 Job
- 删除 Job 时,也不会自动删除对应的 Pod
问题排查
搜索资料,确认级联删除是垃圾收集器提供的。排查 kubelet gc 日志(清理容器和镜像的 gc),未找到明确的问题。
replicaset/cronjob 都是 controller 类型,查看 controller 日志,发现三个主节点上的 controller 日志各不相同。并且有明显报错:
主节点1 controller-manager 报错:
I0703 17:42:50.697659 1 serving.go:319] Generated self-signed cert in-memory
I0703 17:42:52.253742 1 controllermanager.go:161] Version: v1.16.0
I0703 17:42:52.258189 1 secure_serving.go:123] Serving securely on 127.0.0.1:10257
I0703 17:42:52.261809 1 deprecated_insecure_serving.go:53] Serving insecurely on [::]:10252
I0703 17:42:52.261993 1 leaderelection.go:241] attempting to acquire leader lease kube-system/kube-controller-manager...
E0703 20:39:10.062914 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: rpc error: code = Unavailable desc = etcdserver: leader changed
E0706 10:37:12.396567 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get https://apiserver.cluster.local:6443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: net/http: TLS handshake timeout
E0706 10:37:16.441577 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: endpoints "kube-controller-manager" is forbidden: User "system:kube-controller-manager" cannot get resource "endpoints" in API group "" in the namespace "kube-system"
E0706 10:37:18.598949 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get https://apiserver.cluster.local:6443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: dial tcp 192.168.1.90:6443: connect: connection refused
E0706 10:37:21.205271 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get https://apiserver.cluster.local:6443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: dial tcp 192.168.1.90:6443: connect: connection refused
E0706 10:37:24.042719 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get https://apiserver.cluster.local:6443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: dial tcp 192.168.1.90:6443: connect: connection refused
E0706 10:37:26.528240 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get https://apiserver.cluster.local:6443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: dial tcp 192.168.1.90:6443: connect: connection refused
E0706 10:37:29.040759 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get https://apiserver.cluster.local:6443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: dial tcp 192.168.1.90:6443: connect: connection refused
E0706 10:37:31.755211 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get https://apiserver.cluster.local:6443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: dial tcp 192.168.1.90:6443: connect: connection refused
E0706 10:37:43.769537 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get https://apiserver.cluster.local:6443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
E0706 10:37:46.599186 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: endpoints "kube-controller-manager" is forbidden: User "system:kube-controller-manager" cannot get resource "endpoints" in API group "" in the namespace "kube-system"
主节点2 controller-manager 报错:
/apis/monitoring.coreos.com/v1/prometheusrules?limit=500&resourceVersion=0: stream error: stream ID 304059; INTERNAL_ERROR
E0707 16:05:15.833656 1 reflector.go:123] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: Get https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/thanosrulers?limit=500&resourceVersion=0: stream error: stream ID 304063; INTERNAL_ERROR
E0707 16:05:16.067473 1 reflector.go:123] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: Get https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/servicemonitors?limit=500&resourceVersion=0: stream error: stream ID 304065; INTERNAL_ERROR
E0707 16:05:16.718849 1 reflector.go:123] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: Get https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/prometheusrules?limit=500&resourceVersion=0: stream error: stream ID 304071; INTERNAL_ERROR
E0707 16:05:16.841991 1 reflector.go:123] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: Get https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/thanosrulers?limit=500&resourceVersion=0: stream error: stream ID 304073; INTERNAL_ERROR
E0707 16:05:17.070573 1 reflector.go:123] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: Get https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/servicemonitors?limit=500&resourceVersion=0: stream error: stream ID 304075; INTERNAL_ERROR
E0707 16:05:17.721035 1 reflector.go:123] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: Get https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/prometheusrules?limit=500&resourceVersion=0: stream error: stream ID 304077; INTERNAL_ERROR
E0707 16:05:17.850094 1 reflector.go:123] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: Get https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/thanosrulers?limit=500&resourceVersion=0: stream error: stream ID 304079; INTERNAL_ERROR
E0707 16:05:18.073291 1 reflector.go:123] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: Get https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/servicemonitors?limit=500&resourceVersion=0: stream error: stream ID 304081; INTERNAL_ERROR
E0707 16:05:18.224933 1 shared_informer.go:200] unable to sync caches for garbage collector
E0707 16:05:18.224983 1 garbagecollector.go:230] timed out waiting for dependency graph builder sync during GC sync (attempt 803)
E0707 16:05:18.388394 1 namespace_controller.go:148] deletion of namespace monitoring failed: [Delete https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/namespaces/monitoring/servicemonitors: stream error: stream ID 1728289; INTERNAL_ERROR, Delete https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/namespaces/monitoring/prometheusrules: stream error: stream ID 1728343; INTERNAL_ERROR, Delete https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/namespaces/monitoring/thanosrulers: stream error: stream ID 1728381; INTERNAL_ERROR]
E0707 16:05:18.723249 1 reflector.go:123] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: Get https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/prometheusrules?limit=500&resourceVersion=0: stream error: stream ID 304087; INTERNAL_ERROR
E0707 16:05:18.859193 1 reflector.go:123] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: Get https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/thanosrulers?limit=500&resourceVersion=0: stream error: stream ID 304089; INTERNAL_ERROR
主节点3 controller manager 日志:
I0706 21:13:58.923131 1 serving.go:319] Generated self-signed cert in-memory
I0706 21:13:59.235491 1 controllermanager.go:161] Version: v1.16.0
I0706 21:13:59.235870 1 secure_serving.go:123] Serving securely on 127.0.0.1:10257
I0706 21:13:59.236162 1 deprecated_insecure_serving.go:53] Serving insecurely on [::]:10252
I0706 21:13:59.236201 1 leaderelection.go:241] attempting to acquire leader lease kube-system/kube-controller-manager...
查看到 controller manager 访问 apiserver,报错 Internal Error. 于是再查看 apiserver 信息。在主节点2 的 apiserver 上发现明显错误日志:
goroutine 9351585 [running]:
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x3a32ce0, 0xc01f4aa850)
/workspace/anago-v1.16.0-rc.2.1+2bd9643cee5b3b/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa3
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0xc029cfdcd8, 0x1, 0x1)
/workspace/anago-v1.16.0-rc.2.1+2bd9643cee5b3b/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x82
panic(0x3a32ce0, 0xc01f4aa850)
/usr/local/go/src/runtime/panic.go:522 +0x1b5
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP(0xc00ca32240, 0x7b10de0, 0xc02014a310, 0xc0340b7d00)
/workspace/anago-v1.16.0-rc.2.1+2bd9643cee5b3b/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:118 +0x3ef
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.WithWaitGroup.func1(0x7b10de0, 0xc02014a310, 0xc0340b7c00)
/workspace/anago-v1.16.0-rc.2.1+2bd9643cee5b3b/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/waitgroup.go:47 +0xf3
net/http.HandlerFunc.ServeHTTP(0xc004f981e0, 0x7b10de0, 0xc02014a310, 0xc0340b7c00)
/usr/local/go/src/net/http/server.go:1995 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithRequestInfo.func1(0x7b10de0, 0xc02014a310, 0xc0340b7b00)
/workspace/anago-v1.16.0-rc.2.1+2bd9643cee5b3b/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/requestinfo.go:39 +0x2b8
net/http.HandlerFunc.ServeHTTP(0xc004f98210, 0x7b10de0, 0xc02014a310, 0xc0340b7b00)
/usr/local/go/src/net/http/server.go:1995 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithCacheControl.func1(0x7b10de0, 0xc02014a310, 0xc0340b7b00)
/workspace/anago-v1.16.0-rc.2.1+2bd9643cee5b3b/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/cachecontrol.go:31 +0xa8
net/http.HandlerFunc.ServeHTTP(0xc00ca32260, 0x7b10de0, 0xc02014a310, 0xc0340b7b00)
/usr/local/go/src/net/http/server.go:1995 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.WithLogging.func1(0x7b04720, 0xc013c82c18, 0xc0340b7a00)
/workspace/anago-v1.16.0-rc.2.1+2bd9643cee5b3b/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:89 +0x29c
net/http.HandlerFunc.ServeHTTP(0xc00ca32280, 0x7b04720, 0xc013c82c18, 0xc0340b7a00)
/usr/local/go/src/net/http/server.go:1995 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.withPanicRecovery.func1(0x7b04720, 0xc013c82c18, 0xc0340b7a00)
/workspace/anago-v1.16.0-rc.2.1+2bd9643cee5b3b/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/wrap.go:51 +0x105
net/http.HandlerFunc.ServeHTTP(0xc00ca322a0, 0x7b04720, 0xc013c82c18, 0xc0340b7a00)
/usr/local/go/src/net/http/server.go:1995 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server.(*APIServerHandler).ServeHTTP(0xc004f98240, 0x7b04720, 0xc013c82c18, 0xc0340b7a00)
/workspace/anago-v1.16.0-rc.2.1+2bd9643cee5b3b/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/handler.go:189 +0x51
net/http.serverHandler.ServeHTTP(0xc00172a0d0, 0x7b04720, 0xc013c82c18, 0xc0340b7a00)
/usr/local/go/src/net/http/server.go:2774 +0xa8
net/http.initNPNRequest.ServeHTTP(0xc0169cee00, 0xc00172a0d0, 0x7b04720, 0xc013c82c18, 0xc0340b7a00)
/usr/local/go/src/net/http/server.go:3323 +0x8d
k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*serverConn).runHandler(0xc00a8e1080, 0xc013c82c18, 0xc0340b7a00, 0xc0004b9be0)
/workspace/anago-v1.16.0-rc.2.1+2bd9643cee5b3b/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/server.go:2125 +0x89
created by k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*serverConn).processHeaders
/workspace/anago-v1.16.0-rc.2.1+2bd9643cee5b3b/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/server.go:1859 +0x4f4
E0707 16:00:11.468119 1 wrap.go:39] apiserver panic'd on GET /apis/monitoring.coreos.com/v1/servicemonitors?limit=500&resourceVersion=0
I0707 16:00:11.468205 1 log.go:172] http2: panic serving 192.168.1.91:37280: runtime error: invalid memory address or nil pointer dereference
...
错误信息显示,在处理 GET /apis/monitoring.coreos.com/v1/servicemonitors?limit=500&resourceVersion=0
这个请求时,报错:runtime error: invalid memory address or nil pointer dereference
看起来是监控相关的 api,在集群中查找监控相关信息,找到 monitoring
名字空间:
[root@192-168-1-90 ~]# kubectl get ns
NAME STATUS AGE
...... # 省略若干名字空间
default Active 86d
istio-system Active 69d
kube-node-lease Active 86d
kube-public Active 86d
kube-system Active 86d
kubernetes-dashboard Active 86d
monitoring Terminating 30h
monitoring
处于 Terminating
状态,查看该名字空间的详细信息:
apiVersion: v1
kind: Namespace
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","kind":"Namespace","metadata":{"annotations":{},"name":"monitoring"}}
creationTimestamp: "2020-07-06T02:33:50Z"
deletionTimestamp: "2020-07-06T02:36:59Z"
name: monitoring
resourceVersion: "56322781"
selfLink: /api/v1/namespaces/monitoring
uid: 2a41ac04-d86c-4086-9325-5c87dd2a15ac
spec:
finalizers:
- kubernetes
status:
conditions:
- lastTransitionTime: "2020-07-06T02:37:50Z"
message: All resources successfully discovered
reason: ResourcesDiscovered
status: "False"
type: NamespaceDeletionDiscoveryFailure
- lastTransitionTime: "2020-07-06T02:37:50Z"
message: All legacy kube types successfully parsed
reason: ParsedGroupVersions
status: "False"
type: NamespaceDeletionGroupVersionParsingFailure
- lastTransitionTime: "2020-07-06T02:37:50Z"
message: 'Failed to delete all resource types, 3 remaining: Delete https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/namespaces/monitoring/prometheusrules:
stream error: stream ID 190291; INTERNAL_ERROR, Delete https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/namespaces/monitoring/servicemonitors:
stream error: stream ID 190119; INTERNAL_ERROR, Delete https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/namespaces/monitoring/thanosrulers:
stream error: stream ID 190153; INTERNAL_ERROR'
reason: ContentDeletionFailed
status: "True"
type: NamespaceDeletionContentFailure
phase: Terminating
按网上介绍的方法,修改 spec.finalizers
以删除该名字空间,没有任何效果。
# 手动编辑该名字空间的配置,删除 spec.finalizers 属性,没有任何效果。
kubectl edit ns monitoring
# 直接 delete 无效
[root@192-168-1-90 ~]# kubectl delete ns monitoring --grace-period=0 --force
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
Error from server (Conflict): Operation cannot be fulfilled on namespaces "monitoring": The system is ensuring all content is removed from this namespace. Upon completion, this namespace will automatically be purged by the system.
仔细查看上面的错误信息,在 status.conditions
有如下内容:
- lastTransitionTime: "2020-07-06T02:37:50Z"
message: 'Failed to delete all resource types, 3 remaining: Delete https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/namespaces/monitoring/prometheusrules:
stream error: stream ID 190291; INTERNAL_ERROR, Delete https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/namespaces/monitoring/servicemonitors:
stream error: stream ID 190119; INTERNAL_ERROR, Delete https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/namespaces/monitoring/thanosrulers:
stream error: stream ID 190153; INTERNAL_ERROR'
从这些日志看,删除名字空间也需要调用 apiserver 的 https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/namespaces/monitoring/servicemonitors
接口。
但是该接口报 INTERNAL_ERROR,导致名字空间无法删除。
在 Github 上提了 Issue: https://github.com/kubernetes/kubernetes/issues/92858
等待后续反馈。
===========
更新:根据 k8s 官方人员提示,直接改用 1.16.2+ 的 kuberntes,到目前没出过问题。
总结
总的来说,是因为在删除 monitoring 名字空间时,级联删除 coreos 的 crd 报错,直接导致整个 Kubernetes 的级联管理功能彻底失效。。。
蝴蝶效应hhh