一次kuberneets evicted的历险
一、概述
kubernetes 的eviction检测diskpresure,检测的是kubelet的root-dir。kubelet的默认root-dir是/var/lib/kubelet,可以使用参数--root-dir进行修改,源码:
kubernetes/cmd/kubelet/app/options/options.go
const defaultRootDir = "/var/lib/kubelet"
fs.StringVar(&f.RootDirectory, "root-dir", f.RootDirectory, "Directory path for managing kubelet files (volume mounts,etc).")
kubernetes/pkg/kubelet/eviction/helpers.go
// diskUsage converts used bytes into a resource quantity.
func diskUsage(fsStats *statsapi.FsStats) *resource.Quantity {
if fsStats == nil || fsStats.UsedBytes == nil {
return &resource.Quantity{Format: resource.BinarySI}
}
usage := int64(*fsStats.UsedBytes)
return resource.NewQuantity(usage, resource.BinarySI)
}
// rankDiskPressureFunc returns a rankFunc that measures the specified fs stats.
func rankDiskPressureFunc(fsStatsToMeasure []fsStatsType, diskResource v1.ResourceName) rankFunc {
return func(pods []*v1.Pod, stats statsFunc) {
orderedBy(exceedDiskRequests(stats, fsStatsToMeasure, diskResource), priority, disk(stats, fsStatsToMeasure, diskResource)).Sort(pods)
}
}
if nodeFs := summary.Node.Fs; nodeFs != nil {
if nodeFs.AvailableBytes != nil && nodeFs.CapacityBytes != nil {
result[evictionapi.SignalNodeFsAvailable] = signalObservation{
available: resource.NewQuantity(int64(*nodeFs.AvailableBytes), resource.BinarySI),
capacity: resource.NewQuantity(int64(*nodeFs.CapacityBytes), resource.BinarySI),
time: nodeFs.Time,
}
}
type NodeStats struct {
// Reference to the measured Node.
NodeName string `json:"nodeName"`
// Stats of system daemons tracked as raw containers.
// The system containers are named according to the SystemContainer* constants.
// +optional
// +patchMergeKey=name
// +patchStrategy=merge
SystemContainers []ContainerStats `json:"systemContainers,omitempty" patchStrategy:"merge" patchMergeKey:"name"`
// The time at which data collection for the node-scoped (i.e. aggregate) stats was (re)started.
StartTime metav1.Time `json:"startTime"`
// Stats pertaining to CPU resources.
// +optional
CPU *CPUStats `json:"cpu,omitempty"`
// Stats pertaining to memory (RAM) resources.
// +optional
Memory *MemoryStats `json:"memory,omitempty"`
// Stats pertaining to network resources.
// +optional
Network *NetworkStats `json:"network,omitempty"`
// Stats pertaining to total usage of filesystem resources on the rootfs used by node k8s components.
// NodeFs.Used is the total bytes used on the filesystem.
// +optional
Fs *FsStats `json:"fs,omitempty"`
// Stats about the underlying container runtime.
// +optional
Runtime *RuntimeStats `json:"runtime,omitempty"`
// Stats about the rlimit of system.
// +optional
Rlimit *RlimitStats `json:"rlimit,omitempty"`
}
二、事故
事情发生在几个月前,有人修改了fluentd的pattern,fluentd使用ds部署的,里面有挂载了一个hostpath,/var/log.里面的日志会输出到syslog里面。导致pattern不匹配的日志全部打入到/var/log/syslog里面,一个小时写入了7个多G。后面磁盘使用率直接达到了90%,而我们在kubelet里面设置的驱逐策略如下:
evictionHard:
imagefs.available: 15%
memory.available: 100Mi
nodefs.available: 10%
nodefs.inodesFree: 5%
当kubelet的root-dir所在的磁盘使用率达到90%就开始evicted,这个fluentd是没有报错的,只是pattern不匹配然后就把日志输出到了sysylog,所以使用的时候一定要设置好日志的输出路径和日志的输出级别。
三、善后
通过分析源码得出结论,紧急恢复服务。(系统盘的告警阈值没有减掉kubelet里面设置的驱逐阈值)。重新规划监控阈值,线上的node节点设置特性,不同的业务部署在不同node节点上。
--------崔帅的拾荒